# Procedia Computer Science 00 (2012) Trieu Minh Nhut Le, Jinli Cao, and Zhen He.

Save this PDF as:

Size: px
Start display at page:

Download "Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu."

## Transcription

2 / Procedia Computer Science 00 (2012) prediction. This is known as probabilistic data. For example, assume that the data in Table 1 have been collected and analysed statistically, according to historical data resources [12]. Each tuple represents an investment project of USD \$100 to produce a specific product (Product ID). Investing in products based on their probabilities (Probability) will result in an estimated amount of profit. In tuple t 1, a businessman invests USD \$100 on product A, and it has a 0.29 chance of obtaining a profit of USD \$25. Tuple Product ID Profit of USD \$100 investment Probabilistic t 1 A t 2 B t 3 E t 4 B t 5 C t 6 E Table 1. Predicted Profits of USD \$100 investment on Products In the real world, when analysing historical data, predictions on future market trends return two or more values per product with probabilities that the predictions are correct. Therefore, some tuples in Table 1 have the same product ID with different profit. In the probabilistic data model, these tuples are mutually exclusive, and controlled by a set of rules (generation rule) [1, 2, 6, 13]. For example, tuples t 2 and t 4 as project that invest in product B have a 0.3 probability of producing a USD \$18 profit and 0.4 probability of producing a USD \$13 profit. In this case, if the prediction for tuple t 2 is true, then the prediction for tuple t 4 will not be true. It is impossible for both profits to be true for the same product ID. They are mutually exclusive predictions. In Table 1, the probabilistic data are restricted by the exclusive rules R 1 = t 2 t 4 and R 2 = t 3 t 6. Top-k queries can be used to help investors make business decisions such as choosing projects which have the top-2 highest profits. On probabilistic databases, top-k queries can be answered by using the probability space that enumerates the list of all possible worlds [1, 2, 7, 14, 15, 16]. A possible world contains a number of tuples in the probabilistic data set. Each possible world has a non-zero probability for existence and can contain k tuples with highest profits. Different possible worlds can contain different sets of k tuple answers. Therefore, it is necessary to list all possible worlds of Table 1 to find the top-2 answers for the top-2 query of the probabilistic database. Thus, Table 2 lists three dimensions: the possible world, the probability of existence, and the top-2 tuples in each possible world. Possible world Probability of existence Top-2 tuples in possible world W 1 = {t 1, t 2, t 3, t 5 } t 1, t 2 W 2 = {t 1, t 2, t 5, t 6 } t 1, t 2 W 3 = {t 1, t 3, t 4, t 5 } t 1, t 3 W 4 = {t 1, t 4, t 5, t 6 } t 1, t 4 W 5 = {t 1, t 3, t 5 } t 1, t 3 W 6 = {t 1, t 5, t 6 } t 1, t 5 W 7 = {t 2, t 3, t 5 } t 2, t 3 W 8 = {t 2, t 5, t 6 } t 2, t 5 W 9 = {t 3, t 4, t 5 } t 3, t 4 W 10 = {t 4, t 5, t 6 } t 4, t 5 W 11 = {t 3, t 5 } t 3, t 5 W 12 = {t 5, t 6 } t 5, t 6 Table 2. List of all possible worlds and top-2 tuples According to Table 2, any tuple has a probability of being in the top-2. Therefore, Table 3 lists the tuples, profit, probability, and top-2 probability to analyse the top-2 answers of probabilistic databases. The top-2 probability of a tuple is aggregated by the sum of its probabilities of existence in the top-2 in Table 2. In previous research [1, 2], the top-k answers are found using the probability threshold approach called PT-k. The PT-k queries return a set of tuples with top-k probabilities greater than the users threshold value. For example the 2

5 2.1. Probabilistic database model / Procedia Computer Science 00 (2012) Generally, the probabilistic database D = {t 1,..., t n } is assumed to be a single probabilistic table, which is a finite set of probabilistic tuples, where each probabilistic tuple is a tuple associated with probability to denote the uncertainty of the data, and almost all probabilistic tuples in probabilistic databases are independent. Definition 1. Probability of a tuple is the likelihood of a tuple appearing in the data set D. The probability p(t i ) of tuple t i can be presented as a new dimension in the table with values of 0 < p(t i ) 1. Example: In Table 1, tuple t 1 has a 0.29 probability of obtaining a USD\$ 25 profit. Based on the probability theory, if the probability of tuple t i is p(t i ), the probability 1 p(t i ) is the unlikelihood of t i in the data. A simple way to study probabilistic data is listing sets of tuples, in which each tuple is present or absent corresponding to specific probabilities which are called possible worlds. It is interesting to view all possible worlds and study various solutions on probabilistic data. Definition 2. A possible world represents the semantics of a probabilistic database. Each possible world W j = {t 1,..., t k } contains a number of tuples which are members of probabilistic database D. Each possible world is associated with a probability to indicate its existence. The probability of the existence of possible world W j is calculated by multiplying the probabilities of the tuples, which are the likelihood and unlikelihood of tuples in the possible world W j. All possible worlds W = {W 1,..., W m } is called the possible world space. Example: Table 2 is the possible world space that lists all possible worlds of Table 1. For realistic situations, probabilistic tuples on probabilistic data can be independent or dependent on each others. Several probabilistic tuples can be present together or restrict each other s presence in the possible worlds. On the probabilistic database model, they are grouped as a generation rule. Definition 3. In probabilistic data, a generation rule is a set of exclusive or inclusive rules. Each rule contains special tuples, which can be mutually exclusive or inclusive in the possible world space. - The exclusive rule is in the form of R + h = t h 1 t h2... t hq, where is an exclusive operator, and t h1, t h2,..., t hq are members of probabilistic data, which indicates that, at most, one tuple can appear in the same possible world. The sum of all probabilities of tuples in the same exclusive rule must be less than or equal to 1, p(r + h q ) = p(t hi ) 1. i=1 - The inclusive rule is in the form of R h = t h 1 t h 2... t h q. where is an inclusive operator, and t h 1, t h 2,..., t h q are members of probabilistic data, which indicates that non-tuple or all tuples appear in the same possible world. Tuples in R h have the same probability p(r h ) = p(t h 1 ) = p(t h 2 ) =... = p(t h q ) {0 < p(r h ) 1}. Example: In Table 1, the generation rule contains two exclusive rules and zero inclusive rule. Exclusive rules R + 1 = t 2 t 4 and R + 2 = t 3 t 6. R + 1 indicates that tuples t 2 and t 4 never appear together in the same possible world. This is the same for tuples t 3 and t 6 in R Calculation of top-k probability In this subsection, we discuss and formulate the calculation of top-k probabilities of tuples in the possible world space. Definition 4. The top-k probability of a tuple pr k top (t i) is the likelihood of tuple t i appearing in the top-k in all possible worlds. Mathematically, the top-k probability of a tuple pr k top (t i) is the sum of the probabilities of the existence of possible worlds in which the tuple is in the top-k. 5

6 / Procedia Computer Science 00 (2012) Example: In Table 2, since t 2 is one of the top-2 tuples in possible worlds (W 1, W 2, W 7, W 8 ), the top-2 probability of tuple t 2 is the sum of the probabilities of the existence of these possible worlds. That is, pr 2 top(t 2 ) = = 0.3. In the probabilistic data, when the number of tuples is increasing, it is impossible to list all the possible worlds and calculate every probability of the existence of possible worlds at a limited time, because the number of all the possible worlds is 2 n and the computation cost of all probabilities of its existence is very expensive. It is impractical to calculate the top-k probability based on listing the possible world space. Therefore, formulas to calculate the top-k probability of tuples are required. The formulas for this problem have already been solved in mathematics, so we will adapt these formulas to this research. Let probabilistic data D be a ranked sequence (t 1, t 2,..., t n ) with t 1.score t 2.score... t n.score, and S ti = (t 1, t 2,..., t i ) (1 i n) be a subsequence from t 1 to t i. In the possible world space, the top-k probability of tuple t i is the sum of all probabilities of t i being ranked from 1 to k. When tuple t i is ranked at the ( j + 1) th position, there are also j tuples in the subsequence set S ti = (t 1, t 2,..., t i ) appearing in possible worlds. Theorem 1. Given a ranked sequence S ti = (t 1, t 2,..., t i ), pr(s ti, j) is the probability of any j tuples (0 < j i < n) appearing in the sequence S ti. pr(s ti, j) is calculated as follows: In special cases pr(ϕ, 0) = 1 pr(ϕ, j) = 0 pr(s ti 1, 0) = i pr(s ti, j) = pr(s ti 1, j 1) p(t i ) + pr(s ti 1, j) (1 p(t i )) (1 p(t j )) This Poisson binomial recurrence has been proven in [24]. Example: Given the data set in Table 1, pr(s t1, 1), the probability of any one tuple appearing in the sequence S t1 = {t 1 } and pr(s t2, 1), the probability of any one tuple appearing in the sequence S t2 = {t 1, t 2 } are computed as follows: pr(s t1, 1) = pr(ϕ, 0) p(t 1 ) + pr(ϕ, 1) (1 p(t 1 )) = (1 0.29) = 0.29 pr(s t2, 1) = pr(s t1, 0) p(t 2 ) + pr(s t1, 1) (1 p(t 2 )) = (1 0.29) (1 0.3) = Based on Theorem 1, the probability of a tuple being ranked at the exact position is represented by the follow theorem. Theorem 2. Suppose that tuple t i and the subsequence S ti 1, pr(t i, j) is the probability of tuple t i which is ranked at the exact j th position (n > i j > 0). pr(t i, j) is calculated as follows: pr(t i, j) = p(t i ) pr(s ti 1, j 1) This Poisson binomial recurrence has been proven in [24]. Example: Given the data set in Table 1, the 1 st and 2 nd rank probabilities of tuple t 3 are computed as follows: pr(t 3, 1) = p(t 3 ) pr(s t2, 0) = 0.8 (1 0.29) (1 0.3) = pr(t 3, 2) = p(t 3 ) pr(s t2, 1) = = To calculate the top-k probability of tuple t i, Theorem 3 is extended from Theorem 2, that is, a sum of ranked positions from 1 st to k th probabilities of tuple t i. 6

7 / Procedia Computer Science 00 (2012) Theorem 3. Given a tuple t i, pr k top (t i) is the top-k probability of t i in the possible world space, then pr k top (t i) is calculated as follows: pr k top(t i ) = pr(t i, j) = p(t i ) If i k then pr k top(t i ) = p(t i ) pr(s ti 1, j 1) Example: Given the data set in Table 1, the top-2 probability of tuple t 3 is computed as follows: pr 2 top(t 3 ) = 2 pr(t 3, j) = pr(t 3, 1) + pr(t 3, 2) = = In general, the above three theorems are used to directly calculate the top-k probability of all tuples without listing any possible worlds Calculation of top-k probability with a generation rule Generally, we can use the above theorems to calculate the top-k probability of each tuple. However, the probabilistic data model involves mutually exclusive and inclusive rules which have to be mentioned when calculating the top-k probability of tuple formulas. Therefore, the calculations of top-k probabilities have to take these rules into account. Let a probabilistic database D = (t 1, t 2,..., t n ) with t 1.score t 2.score... t n.score be ranked as the sequence, we follow paper [1, 2] for processing the exclusive rules and inclusive rules in formulas to calculate the top-k probabilities of tuple t i Exclusive rules Tuples in the same rule are mutually exclusive R + h = t h 1... t hm, which indicates that, at most, one tuple can appear in the same possible worlds. Therefore, an exclusive rule can be produced as a tuple. Then, the previous formulas are modified to calculate the top-k probability. To compute the top-k probability prtop k (t i) of tuple t i D (1 i n), t i divides the generation rule R + h = t h 1... t hm into two parts R + hle f t = t h 1... t h j and R + hright = t h j+1... t hm. The tuples involved in R + hle f t are ranked higher than or equal to tuple t i. The tuples involved in R + hright are ranked lower than tuple t i. According to this division, the following cases demonstrate the exclusive rule produced as a tuple into the formulas to calculate the top-k probability of tuple t i. - Case 1: R + hle f t = ϕ, i.e. all the tuples in rule R + h are ranked lower than tuple t i. Therefore, all tuples in R + h are not considered when calculating the top-k probability of tuple t i. Consequently, all tuples in R + h are ignored. - Case 2: R + hle f t ϕ, i.e. all tuples in R + hright can be ignored, and the tuples in R + hle f t have been changed when calculating the top-k probability of t i. There are two sub-cases for these changes. + sub-case 1: t i R + hle f t, i.e. t i has already appeared in the possible world space. The other tuples in R + hle f t can not appear, so these tuples will be removed from subsequence S ti 1 when calculating prtop k (t i). + sub-case 2: t i R + hle f t, i.e. all tuples in R+ hle f t will be produced and considered as a tuple t R + with their sum hle f t probabilities p(t R + hle f t ) = p(t hle f t ). R + hle f t After all the exclusive rules are produced, the formulas for calculating the top-k probability in the previously mentioned theorems are used normally. Example: In Table 1, the top-2 probability of tuples t 6 pr 2 top(t 6 ) will be calculated by applying two exclusive rules R + 1 = t 2 t 4 and R + 2 = t 3 t 6. The subsequence S t6 contains the tuples {t 1, t 2, t 3, t 4, t 5, t 6 } with their probabilities {0.29, 0.3, 0.8, 0.4, 1.0, 0.2}, respectively. The exclusive rules are produced in subsequence S t6. - In exclusive rule R + 2, t 3 is removed because t 3 and t 6 are in the same exclusive rule (sub-case 1 of exclusive rule) 7

8 / Procedia Computer Science 00 (2012) In exclusive rule R + 1, t 2 t 4 are produced as t 2 4. The probability of t 2 4 is 0.7 (sub-case 2 of exclusive rule) The subsequence S t6 is produced by the exclusive rule {t 1, t 2 4, t 5, t 6 } with their probabilities {0.29, 0.7, 1.0, 0.2}. prtop(t 2 6 ) with the set S t5 {t 1, t 2 4, t 5 } is calculated as follows: 2 prtop(t 2 6 ) = p(t 6 ) (S t5, j 1) pr 2 top(t 6 ) = p(t 6 ) (pr(s t5, 0) + pr(s t5, 1)), where pr(s t5, 0) = (1 0.29) (1 0.7) (1 1) = 0 pr(s t5, 1) = (1 0.29) (1 0.7) = pr 2 top(t 6 ) = 0.2 ( ) = Inclusive rules Tuples in the same rule are mutually inclusive R h = t h 1 t h 2... t h, which indicates that no tuple or all tuples q appear in possible worlds. To compute the top-k probability prtop k (t i) of tuple t i D(1 i n), tuple t i can be in an inclusive rule or not in the following cases. - Case 1: All the tuples in rule R h are ranked lower than tuple t i. All tuples in R h are ignored. - Case 2: At least one tuple in inclusive rule R h is ranked higher than or equal to tuple t i. There are two sub-cases. + sub-case 1: t i R h, i.e. t i has already appeared in the possible world space, and there are z tuples ranked higher than t i in R h (z < k), which also appeared with t i in the same possible worlds. Therefore, the top-k probability prtop k (t i) of tuple t i of Theorem 3 is modified as follows: k z prtop(t k i ) = p(t i ) pr(s ti 1 R h, j 1) + sub-case 2: t i R h, i.e. there are z tuples ranked higher than t i in R h (z < k), and these tuples are considered as a tuple t R with probability p(t h R ) = p(t h h ) = p(t 1 h ) =... = p(t 2 h ). The sequence S q t i now is S t i {t R }, where h S t i = { {t 1, t 2,..., t i } {t t R h } }, then a probability of any j tuples (0 < j i < n) appearing in the set S ti is modified as follows: pr(s ti, 0) = pr(s t i, 0) (1 p(t R h )) pr(s ti, j) = pr(s t i, j) (1 p(t R h pr(s ti, j) = pr(s t i, j z) p(t R )) (for 0 < j < z) ) + pr(s h t i, j) (1 p(t R )) (for j z) h Example: Assuming two inclusive rules R 1 = t 3 t 5 and R 2 = t 4 t 6 with probabilities p(r 1 ) = p(t 3) = p(t 5 ) = 0.7 and p(r 2 ) = p(t 4) = p(t 6 ) = 0.4 are in probabilistic database D = (t 1, t 2,..., t 10 ) in ranking order. To compute prtop(t 4 6 ), the inclusive rules are produced in subsequence S t5. - In inclusive rule R 2 = t 4 t 6, t 6 R 2, sub-case 1 of the inclusive rule is applied. The subsequence S t 5 R 2 is {t 1, t 2, t 3, t 5 }, and z=1 because t 4 has a higher rank than t 6. We compute the top-4 probability of tuple t prtop(t 4 6 ) = p(t 6 ) pr({t 1, t 2, t 3, t 5 }, j 1) - In inclusive rule R 1 = t 3 t 5, tuples t 3, t 5 are ranked higher than tuple t 6. We produced inclusive rule R 1 into t 3,5 with probability p(t 3,5 ) = p(r 1 ) = 0.7. The set S t 5 is {t 1, t 2, t 3,5 }, and the set S t 5 = {S t5 t 3,5 } is {t 1, t 2 } or S t 5 = S t2. 8

9 / Procedia Computer Science 00 (2012) Then, sub-case 2 of the inclusive rule is applied to calculate the top-4 probability of tuple t 6. 3 prtop(t 4 6 ) = p(t 6 ) pr({t 1, t 2, t 3,5 }, j 1) prtop(t 4 6 ) = p(t 6 ) ( pr({t 1, t 2, t 3,5 }, 0) + pr({t 1, t 2, t 3,5 }, 1) + pr({t 1, t 2, t 3,5 }, 2) ) pr({t 1, t 2, t 3,5 }, 0) = pr(s t2, 0) (1 p(t 3,5 )) pr({t 1, t 2, t 3,5 }, 1) = pr(s t2, 1) (1 p(t 3,5 )) pr({t 1, t 2, t 3,5 }, 2) = pr(s t2, 0) p(t 3,5 ) + pr(s t2, 2) (1 p(t 3,5 )) 3. The top-k best probability queries This section presents the proposed method for finding and computing the answer to the top-k best probability query. First, we create a new definition of the top-k best probability query which is based on the rationale described in the introduction. Secondly, we present the significance of our proposed method. Then, we introduce the process to select an answer to the top-k best probabilistic queries, and several pruning rules for an effective algorithm. Lastly, we describe the algorithm for computing the top-k best probability query Definition of the top-k best probability Tuples which are the result of probabilistic top-k queries must consider not only the ranking score but also the top-k probability [25, 21]. In the area of probabilistic data, significant research is being conducted on the semantics of top-k queries. However, the semantics between high scoring tuples and high top-k probabilities of tuples is interpreted differently by various researchers. Our ranking approach considers both dimensions of ranking score and top-k probability independently, in which the ranking score cannot be considered more important than the top-k probability and vice versa. To answer a probabilistic query, every tuple has two associated dimensions: the top-k probability and its ranking score. These two dimensions are crucial for choosing the answer to the top-k query on probabilistic data. They are also independent of real world applications. In this research, we introduce the concept of dominating tuples to select the full meaning of top-k tuples which are non-dominated tuples. This concept is widely used for multiple independent dimensions for skyline queries in many papers [26, 22, 27, 28, 29, 30]. In this paper, the domination concept for top-k queries is used to compare top-k tuples in the two dimensions of score and top-k probability. Definition 5. (domination of top-k probability tuples) For any tuples t i and t j in probabilistic data, t i dominates t j (t i t j ), if and only if t i has been ranked higher than t j and the top-k probability of t i is greater than the top-k probability of t j (pr k top (t i) > pr k top (t j)), else t i does not dominate t j (t i t j ). Example: in Table 3, tuple t 3 dominates t 5, because t 3 has a higher rank than t 5 (rank(t 3.score) = 3, rank(t 5.score) = 5) and top-2 probability (0.7304) of t 3 is greater than top-2 probability (0.3072) of t 5. Definition 6. (non-dominated tuple) a tuple which is not dominated by all the other tuples on score and top-k probability is the non-dominated tuple. Example: in Table 3, tuples t 1 and t 3 are the non-dominated tuples in the set {t 1, t 3, t 4, t 5, t 6 }. In previous research, the answer to PT-k is based on the top-k probabilities and the threshold. The resultant set, which is analyzed in the introduction, may exclude some highest ranking score tuples or include some redundant tuples. Therefore, we have to create new top-k queries which overcome these problems. Definition 7 is introduced to select the best tuples on score and top-k probability using the domination concept to improve the quality of the top-k answers. That is, we are looking at tuples which are in both the top-k best ranking scores and the best top-k probabilities. 9

11 a. Selecting the answer to the top-k best probability queries / Procedia Computer Science 00 (2012) Suppose that the probabilistic data set has been ranked by score, we have divided n probabilistic tuples of probabilistic data set D into two sets Q score = {t 1, t 2,..., t k } and D \ Q score = {t k+1, t k+2,..., t n }. Q score contains the first k-highest ranking score tuples. To select all non-dominated tuples for Q pro, we first pick the tuple t pmin which has the lowest top-k probability in Q score. This lowest top-k probability tuple is always the first non-dominated tuple in {Q pro {t pmin }}, because t pmin has a higher rank than all other tuples in D \ Q score. This means that all tuples in D \ Q score do not dominate t pmin. As we already have the tuples ranking order, the rest of the non-dominated tuples are selected by only considering their top-k probabilities. The initial value bestpr is assigned bestpr = min k i=1 (prk top (t i)). To select the non-dominated tuples from {t k+1, t k+2,..., t n }, the top-k probability of each tuple from t k+1 to t n is calculated in succession. In each tuple, the top-k probability will be compared to bestpr. If it is greater than bestpr, the tuple will be inserted into Q pro, and bestpr is assigned a new value which is the greater top-k probability. The inserted tuple is non-dominated in Q pro because its top-k probability is greater than all top-k probabilities of tuples in Q pro. When all tuples are executed, all the non-dominated tuples of Q pro are found. The answer set Q k top = Q score Q pro is returned for the top-k best probability query. The value of bestpr will be increased while selecting the non-dominated tuples, which have higher top-k probabilities in the answer L pro, therefore bestpr is called the best top-k probability. The bestpr is also the key value to improve the effectiveness and efficiency of our proposed algorithm because it is used to eliminate all tuples with low top-k probabilities without the need to conduct calculations by following the pruning rules. b. Pruning rules In this subsection, we introduce and prove several theorems to minimize the calculation of the top-k probability of tuples and activate the stopping condition which negates the need to calculate the remaining tuples. The first pruning rule is presented as Theorem 4 which uses the bestpr value to eliminate the current tuple without calculating its top-k probability. Theorem 4. Given any tuple t i and bestpr, if p(t i ) bestpr then pr k top (t i) bestpr Proof. The top-k probability of t i is calculated by Theorem 3 pr k top(t i ) = p(t i ) pr(s ti 1, j 1) with p(t i ) bestpr, and k pr(s ti 1, j 1) 1 then pr k top(t i ) = p(t i ) pr(s ti 1, j 1) bestpr Theorem 5 is a pruning rule which uses probability and top-k probabilities of previous tuples to eliminate the current tuple without the need to perform top-k probability calculation Theorem 5. Given any t i m, t i and bestpr, if p(t i m ) p(t i ) and pr k top (t i m) bestpr then pr k top (t i) < bestpr for any 1 m < i Proof. The top-k probability of t i m is calculated by Theorem 3 pr k top(t i m ) = p(t i m ) 11 pr(s ti m 1, j 1)

12 / Procedia Computer Science 00 (2012) For any 1 m < i, we have sequence S ti m S ti, it is also true when S contains generation rules. pr(s ti m 1, j 1) pr(s ti 1, j 1) with pr k top (t i m) bestpr, and p(t i m ) > p(t i ) then pr k top(t i ) = p(t i ) pr(s ti 1, j 1) prtop(t k i m ) bestpr We note that Theorem 5 is used in our algorithm, in which only the p(t i m ) p(t i ) condition must be considered for eliminating the current tuple, because the other pr k top (t i m) bestpr condition is always true. The reason is because bestpr is assigned the best top-k probability value among tuples t k+1 to t i 1 during algorithm processing. The following Theorem 6 is the most important rule in our algorithm. It is used to stop the algorithm processing and return the answer. The algorithm can stop early, before it calculates the top-k probabilities of the other lower ranking score tuples. Theorem 6. For any t i and bestpr, if bestpr k pr(s ti \{t Lmax }, j 1) then bestpr prtop k (t i+m) for any m 1 where - S ti = (t 1, t 2,..., t i ) is the ranked sequence. - t Lmax is the produced tuple of the generation rule which has the highest produced probability than the other produced probabilities in the generation rules. Proof. The top-k probability of t i+m is calculated by Theorem 3 pr k top(t i+m ) = p(t i+m ) pr(s ti+m 1, j 1) In the worst case, t i+m is in exclusive rule R h. pr k top(t i+m ) p(t i+m ) pr(s ti+m 1 \t RhLe f t, j 1) For any m > 1, we have sequence S ti S ti+m 1, It is also true when calculating the prtop k (t i) with S t i containing an inclusive rule. prtop(t k i+m ) p(t i+m ) pr(s ti \t RhLe f t, j 1) In exclusive rules, we have t RhLe f t t Lmax for any produced tuples pr k top(t i+m ) p(t i+m ) pr(s ti \t Lmax, j 1) For the probability of any tuple, we have 0 < p(t i+m ) 1 pr k top(t i+m ) pr(s ti \t Lmax, j 1) Hence, if bestpr k pr(s ti \t Lmax, j 1), then bestpr prtop k (t i+m) 12

13 / Procedia Computer Science 00 (2012) Input: probabilistic data D in ranking order, generation rules R. Output: Q k top answer set to top-k best probability query 1 foreach (tuple t i {i=1 to n}) do 2 if (i k) then 3 prtop k (t i) p(t i ) {special case in theorem 3}; 4 Q k top Qk top (t i, prtop k (t i)); 5 bestpr min(prtop k (t i)); 6 p prev bestpr; 7 else 8 if (p(t i ) > bestpr) {Theorem 4} (p(t i ) > p prev ) {Theorem 5} then 9 prtop k (t i) calculating top-k probability with generation rules R in Section 2 ; 10 if (prtop k (t i) > bestpr) then 11 bestpr prtop k (t i); 12 Q k top Qk top (t i, prtop k (t i)); 13 p prev p(t i ); 14 if (bestpr satisfies Theorem 6) then 15 Exit; Algorithm 1: The top-k best probability algorithm 3.4. The top-k best probability algorithm We now propose the algorithm for selecting the answers to the top-k best probability queries on probabilistic data. Algorithm 1 finds the top-k best probability answer Q k top = Q score Q pro. - The set Q score, which contains tuples with the top-k highest score, is found from lines 2 to line 6. As seen in the ranking order on score of the probability data set, the first k tuples {t 1..t k } have the best-k score which is added into the answer set Q score (line 4), and their top-k probabilities are equal to their probabilities as proven in Theorem 3 (line 3). Then, bestpr is initialised with the minimum top-k probabilities from t 1 to t k (line 5) and the first value of p prev is assigned the probability of the tuple having minimum prtop k which equals the bestpr (line 6). - The set Q pro containing non-dominated tuples is selected from lines 8 to 13. A current tuple t i is determined to be dominated or not as follows. Firstly, the top-k probability must be calculated with generation rules R (line 9). These complex computation formulas are formally presented in section 2. After this, the top-k probability of the current tuple is compared with bestpr. If the top-k probability is greater than bestpr, the current tuple is a non-dominated tuple (line 10). Then, it is added into the answer set Q k top (line 12). The new values of bestpr and p prev are next considered in lines 11 and For effectiveness, the pruning rules and stopping rule are applied at lines 8 and 14, respectively. The first condition (p(t i ) > bestpr) is presented for Theorem 4, and the second condition (p(t i ) > p prev ) is presented for Theorem 5 (line 8). Theorems 4 and 5 are applied to directly prune the current tuple without the need to calculate top-k probability. Therefore, the algorithm does not calculate the top-k probability, and jumps to the next tuple being considered. In addition, the stopping condition on the top-k best probability process is activated as outlined in Theorem 6 (line 14), which means there are no more non-dominated tuples in the rest of the tuples. - Overall, the top-k best probability algorithm returns the answer Q k top = Q score Q pro. 4. Semantics of top-k best probability queries and other top-k queries This section first discusses the previous semantic ranking properties proposed in papers [3, 4] and proposes the new property k-best ranking scores based on users expectations. Secondly, we examine which properties our approach and previous approaches possess. 13

14 4.1. Semantics of ranking properties / Procedia Computer Science 00 (2012) On probabilistic data, ranking queries have been studied in relation to their semantics. As the semantics of possible worlds is based on the relationship between probability and score, the answers to the ranking query have to be discussed to meet the static and dynamic postulates. The previous work [3, 4] formally defined the key properties of ranking query semantics on probabilistic data, based on the users natural expectations. Five properties are presented and serve as benchmarks to evaluate and categorize the different semantics of ranking queries. Let set Q k top (D) be the answer to the top-k query on probabilistic data D: The first property is Exact-k being the users natural expectations, which has k tuples in the answer set. Exact-k: If D k then Q k top (D) = k. When Qk top (D) k, it is weakly satisfied. The second property is Containment for capturing the intuition of users on results returned. Containment: For any k, the set Q k top (D) Qk+1 top (D). When replacing with, it is weakly satisfied. The third property illustrates that the ranking answer should be unique. Unique ranking: If probabilistic data D D, then the result Q k top (D) Qk top (D ). The fourth property stipulates the limitation of score values on probabilistic databases. Value invariance: Let D be score set (t 1.score t 2.score... t n.score), which is replaced by another score set (t 1.score t 2.score... t n.score ) namely D r. For any k, the set Q k top (D) = Qk top (D r). The fifth property shows the dynamic postulate of ranking queries. Stability: Given a tuple t i = (v i, p(t i )) D, D is a new probabilistic data set where t i is replaced by following t i = (v i, p (t i )) where v i is better than v i, and p (t i ) p(t i ). If t i Q k top (D), then t i Q k top (D ). However, it is desirable that users always expect the results of top-k queries to contain the k-best ranking score tuples, because the traditional top-k query always returns the set of k tuples which have the highest scores. Therefore, the extension of top-k queries to the probabilistic situation must return the answer set which includes the top-k tuples with the best ranking scores in order to satisfy the inheritance property. We propose the capability of providing these k-best scoring tuples as a property of top-k query semantics on probabilistic data. We name the sixth property as k-best ranking scores. k-best ranking scores: Let probabilistic data D = (D c, p), where D c is the database without the probability dimension, and p is the set of probabilities. For any k, Q k top (D c) Q k top (D). In the next section, all the above six properties are used to analyse and discuss the semantics which satisfy the top-k queries on probabilistic databases between our approach and previous approaches Top-k queries satisfying semantic properties We now investigate whether the existing top-k methods and our approach satisfy the semantic properties or not. k-best ranking score Exact-k Method U-top-k [14] Fail Fail Fail Satisfied Satisfied Satisfied U-k-ranks [14, 31] Fail Satisfied Satisfied Fail Satisfied Fail Global-top-k [4] Fail Satisfied Fail Satisfied Satisfied Satisfied E-Score [3] Fail Satisfied Satisfied Satisfied Fail Satisfied E-Rank [3] Fail Satisfied Satisfied Satisfied Satisfied Satisfied PT-k [1, 2] Weak Fail Weak Satisfied Satisfied Satisfied U-popk [16] Fail Satisfied Satisfied Satisfied Satisfied Satisfied Top-k bestpr Satisfied Weak Weak Satisfied Satisfied Satisfied Containment Unique ranking Value invariance Table 4. Summary of top-k methods on the semantics of ranking query properties Stability 14

20 / Procedia Computer Science 00 (2012) Conclusions In this paper, we proposed a novel top-k best probability query for probabilistic databases, which selects the top-k best ranking score and the non-dominated tuples for users. The proposed method also satisfied the semantic ranking properties, which was better than other approaches in terms of semantic ranking queries. Firstly, several concepts and theorems from previous studies were formally defined and discussed in relation to the probabilistic database model. Secondly, we created a new definition for the top-k best probability query based on the top-k best ranking score and dominating concept. The algorithm was built with some proven pruning rules. Then, five semantic ranking properties and a new k-best ranking scores property were introduced to describe the semantic answer to top-k best probability queries. This proposed approach is an improvement on the previous state-of-the-art algorithms. The answers to the top-k best probability queries not only contain the top-k best score but also the top-k best probability tuples. Finally, the experimental results verified the efficiency and effectiveness of our approach on both real and synthetic data sets. The proposed approach has been shown to outperform the algorithm designed for the PT-k query in both efficiency and effectiveness. In many real life domains, uncertain data is inherent in many applications and modern equipment. Therefore, discovering semantic answers to queries is a critical issue in relation to uncertain data. The proposed approach can be applied to modelling, managing, mining, and cleaning uncertain data. References [1] M. Hua, J. Pei, W. Zhang, X. Lin, Ranking queries on uncertain data: a probabilistic threshold approach, in: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD 08, ACM, New York, NY, USA, 2008, pp [2] M. Hua, J. Pei, X. Lin, Ranking queries on uncertain data, The VLDB Journal 20 (2011) [3] G. Cormode, F. Li, K. Yi, Semantics of ranking queries for probabilistic data and expected ranks, in: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE 09, IEEE Computer Society, Washington, DC, USA, 2009, pp [4] X. Zhang, J. Chomicki, Semantics and evaluation of top-k queries in probabilistic databases, Distrib. Parallel Databases 26 (1) (2009) [5] R. Cheng, D. V. Kalashnikov, S. Prabhakar, Evaluating probabilistic queries over imprecise data, in: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD 03, ACM, New York, NY, USA, 2003, pp [6] I. F. Ilyas, G. Beskales, M. A. Soliman, A survey of top-k query processing techniques in relational database systems, ACM Comput. Surv. 40 (4) (2008) 11:1 11:58. [7] A. D. Sarma, O. Benjelloun, A. Halevy, J. Widom, Working models for uncertain data, in: Proceedings of the 22nd International Conference on Data Engineering, ICDE 06, IEEE Computer Society, Washington, DC, USA, [8] L. Getoor, Learning probabilistic relational models, Abstraction, Reformulation, and Approximation 1864 (2000) [9] W. Zhang, X. Lin, J. Pei, Y. Zhang, Managing uncertain data: probabilistic approaches, in: Web-Age Information Management, [10] C. Aggarwal, P. Yu, A survey of uncertain data algorithms and applications, Knowledge and Data Engineering, IEEE Transactions on 21 (5) (2009) [11] T. Pang-Ning, S. Michael, K. Vipin, Introduction to data mining, Library of Congress, [12] S. McClean, B. Scotney, P. Morrow, K. Greer, Knowledge discovery by probabilistic clustering of distributed databases, Data Knowl. Eng. 54 (2) (2005) [13] R. Ross, V. S. Subrahmanian, J. Grant, Aggregate operators in probabilistic databases, J. ACM 52 (1) (2005) [14] M. A. Soliman, I. F. Ilyas, K. C.-C. Chang, Top-k query processing in uncertain databases, in: ICDE, 2007, pp [15] S. Abiteboul, P. Kanellakis, G. Grahne, On the representation and querying of sets of possible worlds, SIGMOD Rec. 16 (3) (1987) [16] D. Yan, W. Ng, Robust ranking of uncertain data, in: Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I, DASFAA 11, Springer-Verlag, Berlin, Heidelberg, 2011, pp [17] T. M. N. Le, J. Cao, Top-k best probability queries on probabilistic data, in: Proceedings of the 17th international conference on Database systems for advanced applications - Volume Part II, DASFAA 12, Springer-Verlag, Berlin, Heidelberg, 2012, pp [18] T. Ge, S. Zdonik, S. Madden, Top-k queries on uncertain data: on score distribution and typical answers, in: Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD 09, ACM, New York, NY, USA, 2009, pp [19] J. Li, B. Saha, A. Deshpande, A unified approach to ranking in probabilistic databases, The VLDB Journal 20 (2) (2011) [20] J. Pei, M. Hua, Y. Tao, X. Lin, Query answering techniques on uncertain and probabilistic data: tutorial summary, in: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD 08, ACM, New York, NY, USA, 2008, pp [21] S. Zhang, C. Zhang, A probabilistic data model and its semantics, Journal of Research and Practice in Information Technology 35 (2003) [22] M. J. Atallah, Y. Qi, Computing all skyline probabilities for uncertain data, in: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT- SIGART symposium on Principles of database systems, PODS 09, ACM, New York, NY, USA, 2009, pp [23] B. Qin, Y. Xia, Generating efficient safe query plans for probabilistic databases, Data Knowl. Eng. 67 (3) (2008) [24] K. Lange, Numerical analysis for statisticians, Springer, [25] D. Papadias, Y. Tao, G. Fu, B. Seeger, An optimal and progressive algorithm for skyline queries, in: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD 03, ACM, New York, NY, USA, 2003, pp [26] X. Lian, L. Chen, Shooting top-k stars in uncertain databases, The VLDB Journal 20 (6) (2011)

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

Breaking The Code Ryan Lowe Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and a minor in Applied Physics. As a sophomore, he took an independent study

### Lecture Note 1 Set and Probability Theory. MIT 14.30 Spring 2006 Herman Bennett

Lecture Note 1 Set and Probability Theory MIT 14.30 Spring 2006 Herman Bennett 1 Set Theory 1.1 Definitions and Theorems 1. Experiment: any action or process whose outcome is subject to uncertainty. 2.

### P (A) = lim P (A) = N(A)/N,

1.1 Probability, Relative Frequency and Classical Definition. Probability is the study of random or non-deterministic experiments. Suppose an experiment can be repeated any number of times, so that we

### Chapter 21: The Discounted Utility Model

Chapter 21: The Discounted Utility Model 21.1: Introduction This is an important chapter in that it introduces, and explores the implications of, an empirically relevant utility function representing intertemporal

### Roots of Polynomials

Roots of Polynomials (Com S 477/577 Notes) Yan-Bin Jia Sep 24, 2015 A direct corollary of the fundamental theorem of algebra is that p(x) can be factorized over the complex domain into a product a n (x

### 6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

6 EXTENDING ALGEBRA Chapter 6 Extending Algebra Objectives After studying this chapter you should understand techniques whereby equations of cubic degree and higher can be solved; be able to factorise

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### On closed-form solutions of a resource allocation problem in parallel funding of R&D projects

Operations Research Letters 27 (2000) 229 234 www.elsevier.com/locate/dsw On closed-form solutions of a resource allocation problem in parallel funding of R&D proects Ulku Gurler, Mustafa. C. Pnar, Mohamed

### Load Balancing and Switch Scheduling

EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

### Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Moral Hazard Itay Goldstein Wharton School, University of Pennsylvania 1 Principal-Agent Problem Basic problem in corporate finance: separation of ownership and control: o The owners of the firm are typically

### Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

### Stock market simulation with ambient variables and multiple agents

Stock market simulation with ambient variables and multiple agents Paolo Giani Cei 0. General purposes The aim is representing a realistic scenario as a background of a complete and consistent stock market.

### People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

PROBABILITY AND LIKELIHOOD, A BRIEF INTRODUCTION IN SUPPORT OF A COURSE ON MOLECULAR EVOLUTION (BIOL 3046) Probability The subject of PROBABILITY is a branch of mathematics dedicated to building models

### CHAPTER 7 GENERAL PROOF SYSTEMS

CHAPTER 7 GENERAL PROOF SYSTEMS 1 Introduction Proof systems are built to prove statements. They can be thought as an inference machine with special statements, called provable statements, or sometimes

### NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

### Polynomials. Dr. philippe B. laval Kennesaw State University. April 3, 2005

Polynomials Dr. philippe B. laval Kennesaw State University April 3, 2005 Abstract Handout on polynomials. The following topics are covered: Polynomial Functions End behavior Extrema Polynomial Division

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 11

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According

### A Simple Model of Price Dispersion *

Federal Reserve Bank of Dallas Globalization and Monetary Policy Institute Working Paper No. 112 http://www.dallasfed.org/assets/documents/institute/wpapers/2012/0112.pdf A Simple Model of Price Dispersion

### Inference of Probability Distributions for Trust and Security applications

Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### Decision-making with the AHP: Why is the principal eigenvector necessary

European Journal of Operational Research 145 (2003) 85 91 Decision Aiding Decision-making with the AHP: Why is the principal eigenvector necessary Thomas L. Saaty * University of Pittsburgh, Pittsburgh,

### 1 Formulating The Low Degree Testing Problem

6.895 PCP and Hardness of Approximation MIT, Fall 2010 Lecture 5: Linearity Testing Lecturer: Dana Moshkovitz Scribe: Gregory Minton and Dana Moshkovitz In the last lecture, we proved a weak PCP Theorem,

### Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

### Notes 11: List Decoding Folded Reed-Solomon Codes

Introduction to Coding Theory CMU: Spring 2010 Notes 11: List Decoding Folded Reed-Solomon Codes April 2010 Lecturer: Venkatesan Guruswami Scribe: Venkatesan Guruswami At the end of the previous notes,

### Pricing of Limit Orders in the Electronic Security Trading System Xetra

Pricing of Limit Orders in the Electronic Security Trading System Xetra Li Xihao Bielefeld Graduate School of Economics and Management Bielefeld University, P.O. Box 100 131 D-33501 Bielefeld, Germany

### Chapter 7. Sealed-bid Auctions

Chapter 7 Sealed-bid Auctions An auction is a procedure used for selling and buying items by offering them up for bid. Auctions are often used to sell objects that have a variable price (for example oil)

### Big data algorithm optimization. Case study of a sales simulation system KASPER KARLSSON TOBIAS LANS

Big data algorithm optimization Case study of a sales simulation system KASPER KARLSSON TOBIAS LANS Chalmers University of Technology Department of Computer Science & Engineering Computer Science - Algorithms,

### An Audit Environment for Outsourcing of Frequent Itemset Mining

An Audit Environment for Outsourcing of Frequent Itemset Mining W. K. Wong The University of Hong Kong wkwong2@cs.hku.hk David W. Cheung The University of Hong Kong dcheung@cs.hku.hk Ben Kao The University

### Online Supplement for Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers by Mohammad H. Yarmand and Douglas G.

Online Supplement for Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers by Mohammad H Yarmand and Douglas G Down Appendix A Lemma 1 - the remaining cases In this appendix,

### 1 Introduction to Option Pricing

ESTM 60202: Financial Mathematics Alex Himonas 03 Lecture Notes 1 October 7, 2009 1 Introduction to Option Pricing We begin by defining the needed finance terms. Stock is a certificate of ownership of

### Worksheet for Teaching Module Probability (Lesson 1)

Worksheet for Teaching Module Probability (Lesson 1) Topic: Basic Concepts and Definitions Equipment needed for each student 1 computer with internet connection Introduction In the regular lectures in

### A Lightweight Solution to the Educational Data Mining Challenge

A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

### A Simple Characterization for Truth-Revealing Single-Item Auctions

A Simple Characterization for Truth-Revealing Single-Item Auctions Kamal Jain 1, Aranyak Mehta 2, Kunal Talwar 3, and Vijay Vazirani 2 1 Microsoft Research, Redmond, WA 2 College of Computing, Georgia

### 17.6.1 Introduction to Auction Design

CS787: Advanced Algorithms Topic: Sponsored Search Auction Design Presenter(s): Nilay, Srikrishna, Taedong 17.6.1 Introduction to Auction Design The Internet, which started of as a research project in

### Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur Lecture No. #06 Cryptanalysis of Classical Ciphers (Refer

### 6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks

6.042/8.062J Mathematics for Comuter Science December 2, 2006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Random Walks Gambler s Ruin Today we re going to talk about one-dimensional random walks. In

### Measuring the Performance of an Agent

25 Measuring the Performance of an Agent The rational agent that we are aiming at should be successful in the task it is performing To assess the success we need to have a performance measure What is rational

### Search Result Optimization using Annotators

Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

### Classification/Decision Trees (II)

Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

### Using Laws of Probability. Sloan Fellows/Management of Technology Summer 2003

Using Laws of Probability Sloan Fellows/Management of Technology Summer 2003 Uncertain events Outline The laws of probability Random variables (discrete and continuous) Probability distribution Histogram

### REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k.

REPEATED TRIALS Suppose you toss a fair coin one time. Let E be the event that the coin lands heads. We know from basic counting that p(e) = 1 since n(e) = 1 and 2 n(s) = 2. Now suppose we play a game

### 4 Domain Relational Calculus

4 Domain Relational Calculus We now present two relational calculi that we will compare to RA. First, what is the difference between an algebra and a calculus? The usual story is that the algebra RA is

### Session 8 Probability

Key Terms for This Session Session 8 Probability Previously Introduced frequency New in This Session binomial experiment binomial probability model experimental probability mathematical probability outcome

### 1 NOT ALL ANSWERS ARE EQUALLY

1 NOT ALL ANSWERS ARE EQUALLY GOOD: ESTIMATING THE QUALITY OF DATABASE ANSWERS Amihai Motro, Igor Rakov Department of Information and Software Systems Engineering George Mason University Fairfax, VA 22030-4444

### 4/1/2017. PS. Sequences and Series FROM 9.2 AND 9.3 IN THE BOOK AS WELL AS FROM OTHER SOURCES. TODAY IS NATIONAL MANATEE APPRECIATION DAY

PS. Sequences and Series FROM 9.2 AND 9.3 IN THE BOOK AS WELL AS FROM OTHER SOURCES. TODAY IS NATIONAL MANATEE APPRECIATION DAY 1 Oh the things you should learn How to recognize and write arithmetic sequences

### Factoring & Primality

Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

### Mathematical Induction

Mathematical Induction Victor Adamchik Fall of 2005 Lecture 2 (out of three) Plan 1. Strong Induction 2. Faulty Inductions 3. Induction and the Least Element Principal Strong Induction Fibonacci Numbers

### Table 1: Field Experiment Dependent Variable Probability of Donation (0 to 100)

Appendix Table 1: Field Experiment Dependent Variable Probability of Donation (0 to 100) In Text Logit Warm1 Warm2 Warm3 Warm5 Warm4 1 2 3 4 5 6 Match (M) 0.731 0.446 0.965 1.117 0.822 0.591 (0.829) (0.486)

### Clean Answers over Dirty Databases: A Probabilistic Approach

Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University

### Integer roots of quadratic and cubic polynomials with integer coefficients

Integer roots of quadratic and cubic polynomials with integer coefficients Konstantine Zelator Mathematics, Computer Science and Statistics 212 Ben Franklin Hall Bloomsburg University 400 East Second Street

### Economics 200B Part 1 UCSD Winter 2015 Prof. R. Starr, Mr. John Rehbeck Final Exam 1

Economics 200B Part 1 UCSD Winter 2015 Prof. R. Starr, Mr. John Rehbeck Final Exam 1 Your Name: SUGGESTED ANSWERS Please answer all questions. Each of the six questions marked with a big number counts

### Specialists in Strategic, Enterprise and Project Risk Management. PROJECT RISK MANAGEMENT METHODS Dr Stephen Grey, Associate Director

BROADLEAF CAPITAL INTERNATIONAL PTY LTD ACN 054 021 117 23 Bettowynd Road Tel: +61 2 9488 8477 Pymble Mobile: +61 419 433 184 NSW 2073 Fax: + 61 2 9488 9685 Australia www.broadleaf.com.au Cooper@Broadleaf.com.au

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative

### DATA VERIFICATION IN ETL PROCESSES

KNOWLEDGE ENGINEERING: PRINCIPLES AND TECHNIQUES Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2007 Cluj-Napoca (Romania), June 6 8, 2007, pp. 282

### (x + a) n = x n + a Z n [x]. Proof. If n is prime then the map

22. A quick primality test Prime numbers are one of the most basic objects in mathematics and one of the most basic questions is to decide which numbers are prime (a clearly related problem is to find

### Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

Wald s Identity by Jeffery Hein Dartmouth College, Math 100 1. Introduction Given random variables X 1, X 2, X 3,... with common finite mean and a stopping rule τ which may depend upon the given sequence,

### Faster deterministic integer factorisation

David Harvey (joint work with Edgar Costa, NYU) University of New South Wales 25th October 2011 The obvious mathematical breakthrough would be the development of an easy way to factor large prime numbers

### A Survey of Top-k Query Processing Techniques in Relational Database Systems

11 A Survey of Top-k Query Processing Techniques in Relational Database Systems IHAB F. ILYAS, GEORGE BESKALES, and MOHAMED A. SOLIMAN University of Waterloo Efficient processing of top-k queries is a

### Data Integration with Uncertainty

Data Integration with Uncertainty Xin Dong University of Washington Seattle, WA 98195 lunadong@cs.washington.edu Alon Y. Halevy Google Inc. Mountain View, CA 94043 halevy@google.com Cong Yu University

### 24. The Branch and Bound Method

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

### A set is a Many that allows itself to be thought of as a One. (Georg Cantor)

Chapter 4 Set Theory A set is a Many that allows itself to be thought of as a One. (Georg Cantor) In the previous chapters, we have often encountered sets, for example, prime numbers form a set, domains

### 9.2 Summation Notation

9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a

### Lecture Notes on Polynomials

Lecture Notes on Polynomials Arne Jensen Department of Mathematical Sciences Aalborg University c 008 Introduction These lecture notes give a very short introduction to polynomials with real and complex

### Module 6. Embedded System Software. Version 2 EE IIT, Kharagpur 1

Module 6 Embedded System Software Version 2 EE IIT, Kharagpur 1 Lesson 30 Real-Time Task Scheduling Part 2 Version 2 EE IIT, Kharagpur 2 Specific Instructional Objectives At the end of this lesson, the

### A Survey of Top-k Query Processing Techniques in Relational Database Systems

A Survey of Top-k Query Processing Techniques in Relational Database Systems IHAB F. ILYAS, GEORGE BESKALES and MOHAMED A. SOLIMAN David R. Cheriton School of Computer Science University of Waterloo Efficient

### Chapter 5. Discrete Probability Distributions

Chapter 5. Discrete Probability Distributions Chapter Problem: Did Mendel s result from plant hybridization experiments contradicts his theory? 1. Mendel s theory says that when there are two inheritable

### 6 Scalar, Stochastic, Discrete Dynamic Systems

47 6 Scalar, Stochastic, Discrete Dynamic Systems Consider modeling a population of sand-hill cranes in year n by the first-order, deterministic recurrence equation y(n + 1) = Ry(n) where R = 1 + r = 1

### Cleaning Uncertain Data with Quality Guarantees

Cleaning Uncertain Data with Quality Guarantees Reynold Cheng Jinchuan Chen Xike Xie Department of Computing The Hong Kong Polytechnic University Hung Hom, Kowloon, Hong Kong {csckcheng,csjcchen,csxxie}@comp.polyu.edu.hk

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### Actively Soliciting Feedback for Query Answers in Keyword Search-Based Data Integration

Actively Soliciting Feedback for Query Answers in Keyword Search-Based Data Integration Zhepeng Yan Nan Zheng Zachary G. Ives Partha Pratim Talukdar Cong Yu University of Pennsylvania Carnegie Mellon University

### Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March

### Lecture 14: Section 3.3

Lecture 14: Section 3.3 Shuanglin Shao October 23, 2013 Definition. Two nonzero vectors u and v in R n are said to be orthogonal (or perpendicular) if u v = 0. We will also agree that the zero vector in

### Chapter 4 Lecture Notes

Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,

### SYSTEMS OF EQUATIONS

SYSTEMS OF EQUATIONS 1. Examples of systems of equations Here are some examples of systems of equations. Each system has a number of equations and a number (not necessarily the same) of variables for which

### arxiv:1112.0829v1 [math.pr] 5 Dec 2011

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

### Information, Entropy, and Coding

Chapter 8 Information, Entropy, and Coding 8. The Need for Data Compression To motivate the material in this chapter, we first consider various data sources and some estimates for the amount of data associated

### Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous

### THE ROOMMATES PROBLEM DISCUSSED

THE ROOMMATES PROBLEM DISCUSSED NATHAN SCHULZ Abstract. The stable roommates problem as originally posed by Gale and Shapley [1] in 1962 involves a single set of even cardinality 2n, each member of which

### Probability and Expected Value

Probability and Expected Value This handout provides an introduction to probability and expected value. Some of you may already be familiar with some of these topics. Probability and expected value are

### THE TURING DEGREES AND THEIR LACK OF LINEAR ORDER

THE TURING DEGREES AND THEIR LACK OF LINEAR ORDER JASPER DEANTONIO Abstract. This paper is a study of the Turing Degrees, which are levels of incomputability naturally arising from sets of natural numbers.

### Gains from Trade. Christopher P. Chambers and Takashi Hayashi. March 25, 2013. Abstract

Gains from Trade Christopher P. Chambers Takashi Hayashi March 25, 2013 Abstract In a market design context, we ask whether there exists a system of transfers regulations whereby gains from trade can always

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### Theory of Computation Prof. Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology, Madras

Theory of Computation Prof. Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture No. # 31 Recursive Sets, Recursively Innumerable Sets, Encoding

### On Correlating Performance Metrics

On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are

### The Trip Scheduling Problem

The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

### Policy Analysis for Administrative Role Based Access Control without Separate Administration

Policy nalysis for dministrative Role Based ccess Control without Separate dministration Ping Yang Department of Computer Science, State University of New York at Binghamton, US Mikhail I. Gofman Department

### Incorporating Evidence in Bayesian networks with the Select Operator

Incorporating Evidence in Bayesian networks with the Select Operator C.J. Butz and F. Fang Department of Computer Science, University of Regina Regina, Saskatchewan, Canada SAS 0A2 {butz, fang11fa}@cs.uregina.ca

### Computing Near Optimal Strategies for Stochastic Investment Planning Problems

Computing Near Optimal Strategies for Stochastic Investment Planning Problems Milos Hauskrecfat 1, Gopal Pandurangan 1,2 and Eli Upfal 1,2 Computer Science Department, Box 1910 Brown University Providence,

### Efficiency in Software Development Projects

Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University aneeshchinubhai@gmail.com Abstract A number of different factors are thought to influence the efficiency of the software

### An example of a computable

An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

### Simulation and Lean Six Sigma

Hilary Emmett, 22 August 2007 Improve the quality of your critical business decisions Agenda Simulation and Lean Six Sigma What is Monte Carlo Simulation? Loan Process Example Inventory Optimization Example

### Topic 1: Matrices and Systems of Linear Equations.

Topic 1: Matrices and Systems of Linear Equations Let us start with a review of some linear algebra concepts we have already learned, such as matrices, determinants, etc Also, we shall review the method

### 3-17 15-25 5 15-10 25 3-2 5 0. 1b) since the remainder is 0 I need to factor the numerator. Synthetic division tells me this is true

Section 5.2 solutions #1-10: a) Perform the division using synthetic division. b) if the remainder is 0 use the result to completely factor the dividend (this is the numerator or the polynomial to the

### Firewall Verification and Redundancy Checking are Equivalent

Firewall Verification and Redundancy Checking are Equivalent H. B. Acharya University of Texas at Austin acharya@cs.utexas.edu M. G. Gouda National Science Foundation University of Texas at Austin mgouda@nsf.gov