# Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Save this PDF as:

Size: px
Start display at page:

## Transcription

3 where m is the number of bits in BF and n is the number of distinct elements seen so far [25]. 3. STABLE BLOOM FILTERS The Bloom filter is shown to be useful for representing the presence of a set of elements and answering membership queries, provided that a proper amount of space is allocated according to the number of distinct elements in the set. 3. The Challenge to Bloom Filters However, in many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros in the Bloom filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit,, where every distinct element will be reported as a duplicate, indicating that the Bloom filter is useless. Our general solution is to avoid the state where the Bloom filter is full by evicting some information from it before the error rate reaches a predefined threshold. This is similar to the replacement operation in the buffering method, in which there are several possible policies for choosing a past element to drop. In many real world data stream applications, often the recent data is more important than the older data [3, 26]. However, for the regular Bloom filter, there is no way to distinguish the recent elements from the past ones, since no time information is kept. Accordingly, we add a random deletion operation into the Bloom filter so that it does not exceed its capacity in a data stream scenario. 3.2 Our Approach To solve this problem, we introduce the Stable Bloom Filter, an extension of the regular Bloom filter. Definition (Stable Bloom Filter (SBF)). An SBF is defined as an array of integer SBF [],..., SBF [m] whose minimum value is 0 and maximum value is Max. The update process follows Algorithm. Each element of the array is allocated d bits; the relation between Max and d is then Max = 2 d. Compared to bits in a regular Bloom filter, each element of the SBF is called a cell. Concretely speaking, we change bits in the regular Bloom filter into cells, each consisting of one or more bits. The initial value of the cells is still zero. Each newly arrived element in the stream is mapped to K cells by some uniform and independent hash functions. As in a regular Bloom filter, we can check if a new element is duplicate or not by probing whether all the cells the element is hashed to are non-zero. This is the duplicate detection process. After that, we need to update the SBF. We first randomly decrement P cells by so as to make room for fresh elements; we then set the same K cells as in the detection process to Max. Our symbol list is shown in Table, and the detailed algorithm is described in Algorithm. 3.3 The Stable Property Based on the algorithm, we find an important property of SBF both in theory and in experiments: after a number of Algorithm : Approximately Detect Duplicates using SBF Data: A sequence of numbers S = x,..., x i,..., x N. Result: A sequence of Yes/No corresponding to each input number. begin initialize SBF[]... SBF[m] = 0 for each x i S do Probe K cells SBF [h (x i)]... SBF [h K(x i)] if none of the above K cells is 0 then DuplicateFlag = Yes else DuplicateFlag = No Select P different cells uniformly at random SBF [j ]... SBF [j P ], P {,..., m} for each cell SBF [j] {SBF [j ],..., SBF [j P ]} do if SBF [j] then SBF [j] = SBF [j] end for each cell {SBF [h (x i)],..., SBF [h K(x i)} do SBF [h(x i)] = Max Output DuplicateFlag Table : The Symbol List Symbols Meanings N Number of elements in the input stream M Total space available in bits m Number of cells in the SBF Max The value a cell is set to d Number of bits allocated per cell K Number of hash functions k The probability that a cell is set in each iteration P Number of cells we pick to decrement by in each iteration p The probability that a cell is picked to be decremented by in each iteration The ith hash function h i iterations, the fraction of zeros in the SBF will become fixed no matter what parameters we set at the beginning. We call this the stable property of SBF and deem it important to our problem because the false positive rate is dependent on the fraction of zeros in SBF. Theorem. Given an SBF with m cells, if in each iteration, a cell is decremented by with a probability p and set to Max with a probability k, the probability that the cells becomes zero after N iterations is a constant, provided that N is large enough, i.e. lim P r(sbfn = 0) N exists, where SBF N is the value of the cell at the end of iteration N. In our formal discussion, we assume that the underlying distribution of the input data does not change over time.

4 Our experiments on the real world data show that this is not a very strong assumption, and the experimental results verify our theory. Proof. Within each iteration, there are three operations: detecting duplicates, decreasing cell values and setting cells to Max. Since the first operation does not change the values of the cells, we just focus on the other two operations. Within the process of iterations from to N, the cell could be set 0,..., (N ) or even N times. Since the newest setting operation clears the impact of any previous operations, we can just focus on the process after the newest setting operation. Let A l denote the event that within the N iterations the most recent setting operation applied to the cell occurs at iteration N l, which means that no setting happened within the most recent l iterations (i.e. from iteration N l + to iteration N, l < N), and let ĀN denotes the event that the cell has never been set within the whole N iterations. Hence, the probability that the cell is zero after N iterations is as follows: N P r(sbf N = 0) = [P r(sbf N = 0 A l )P r(a l )] () l=max + P r(sbf N = 0 ĀN )P r(ān ), where P r(sbf N = 0 A l ) = l j=max ( ) l j p j ( p) l j (2) P r(a l ) = ( k) l k (3) P r(sbf N = 0 ĀN ) = (4) P r(ān ) = ( k)n. (5) We have Eq. 2 because during those l iterations, there is no setting operation, and the cell becomes zero if and only if it is decremented by no less than Max times. Clearly when l < Max, the cell is impossible to be decreased to 0, and l = N means ĀN happens, so we just consider the cases of Max l (N ) in Eq. 2. When ĀN happens, the cell is 0 with a probability because the initial value of the cell is 0 and it has never been set, therefore we have Eq. 4. Having the above equations, we can prove that lim N P r(sbf N = 0) exists. Due to the page limit, we put the detailed proof in an extended version of this paper. Having Theorem, now we can prove our stable property statement. Corollary (Stable Property). The expected fraction of zeros in an SBF after N iterations is a constant, provided that N is large enough. to that cell. Since the underlying distribution of the input data does not change, the probability that a particular element appears in each iteration is fixed. Therefore, the probability of each cell being set is fixed. Meanwhile, the probability that an arbitrary cell is decremented by is also a constant. According to Theorem, the probabilities of all cells in the SBF becoming 0 after N iterations are constants, provided that N is large enough. Therefore, the expected fraction of 0 in an SBF after N iterations is a constant, provided that N is large enough. Now we know an SBF converges. In fact this convergence is like the process that a buffer is filled by items continually. SBF is stable means that its maximum capacity is reached, similar to the case that a buffer is full of items. Another important property is the convergence rate. Corollary 2 (Convergence Rate). The expected fraction of 0s in the SBF converges at an exponential rate. Proof. From Eq., 4 and 5, we can derive P r(sbf N = 0) P r(sbf N = 0) =P r(sbf N = 0 A N )P r(a N ) + P r(ān ) P r(ān ) =k( k) N (P r(sbf N = 0 A N ) ) Clearly, Eq. 6 exponentially converges to 0. i.e. P r(sbf N [c]= 0) converges at an exponential rate, and this is true for all cells in the SBF. Therefore, the expected fraction of 0s in the SBF converges at an exponential rate. Lemma (Monotonicity). The expected fraction of 0s in an SBF is monotonically non-increasing. Proof. Since the value of Eq. 6 is always no greater than 0, the probability that a cell becomes zero is always decreasing or remains the same. Similar to the proof of Corollary 2, we can draw the conclusion. This lemma will be used to prove our general upper bound of the false positive rate where the number of iterations needs not to be infinity. 3.4 The Stable Point Currently we know the fraction of 0s in an SBF will be a constant at some point, but we do not know the value of this constant. We call this constant the stable point. Definition 2 (Stable Point). The stable point is defined as the limit of the expected fraction of 0s in an SBF when the number of iterations goes to infinity. When this limit is reached, we call SBF stable. (6) Proof. In each iteration, each cell of the SBF has a certain probability of being set to Max by the element hashed From Eq., we are unable to obtain the limit directly. However, we can derive it indirectly.

5 Theorem 2. Given an SBF with m cells, if a cell is decremented by with a constant probability p and set to Max with a constant probability k in each iteration, and if the probability that the cell becomes 0 at the end of iteration N is denoted by P r(sbf N = 0), lim P N r(sbfn = 0) = ( + p(/k ) ) Max (7) Proof. The basic idea is to make use of the fact that SBF is stable, the expected fraction of 0,,...,Max in SBF should be all constant. See the full paper for the details. The theorem can be verified by replacing the parameters in Eq. with some testing values. From Theorem 2 we know the probability that a cell becomes 0 when SBF is stable. If all cells have the same probability of being set, we can obtain the stable point easily. However, that requires the data stream to be uniformly distributed. Without this uniform distribution assumption, we have the following statement. Theorem 3 (SBF Stable Point). When an SBF is stable, the expected fraction of 0s in the SBF is no less than ( + P (/K /m) ) Max, where K is the number of cells being set to Max and P is the number of cells decremented by within each iteration. Proof. The basic idea is to prove the case of m = 2 first, and generalize it to m 2. See the full paper for the details. 3.5 False Positive Rates In our method, there could be two kinds of errors: false positives (FP) and false negatives (FN). A false positive happens when a distinct element is wrongly reported as duplicate; a false negative happens when a duplicate element is wrongly reported as distinct. We call their probabilities false positive rates and false negative rates. Corollary 3 (FP Bound when Stable). When an SBF is stable, the FP rate is a constant no greater than F P S, F P S = ( ( + P (/K /m) ) Max ) K. (8) Proof. If P r j(0) denotes the probability that the cell SBF [j] = 0 when the SBF is stable, the FP rate is ( m ( P r(0)) + + ( P rm(0))k m = ( (P r(0) + + P rm(0)))k m Please note that (P r(0) + + P rm(0)) is the expected m fraction of 0s in the SBF. According to Theorems and 3, the FP rate is a constant and Eq. 8 is an upper bound of the FP rate. This upper bound can be reached when the stream elements are uniformly distributed. Corollary 4 (The case of reaching the FP Bound). Given an SBF with m cells, if the stream elements are uniformly distributed when the SBF is stable, the FP rate is F P S (Eq. 8). Proof. Because elements in the input data stream are uniformly distributed, each cell in the SBF will have the same probability to be set to Max. According to Theorem and the proof of Theorem 3 we can derive this statement. Corollary 5 (General FP Bound). Given an SBF with m cells, F P S (Eq. 8) is an upper bound for FP rates at all time points, i.e. before and after the SBF becomes stable. Proof. This can be easily derived from Lemma and Corollary 3. Therefore, the upper bound for FP rates is valid no matter the SBF is stable or not. From Eq. 8 we can see that m has little impact on F P S, since /m is negligible compared to /K (m K). This means the amount of space has little impact on the FP bound once the other parameters are fixed. The value of P has a direct impact on F P S: the larger the value of P, the smaller the value of F P S. This can be seen intuitively: the faster the cells are cleared, the more 0s the SBF has, thus the smaller the value of F P S is. Oppositely, increasing the value of Max results in the increase of F P S. In contrast to P and Max, from the formula we can see the impact of the value of K on F P S is twofold: intuitively, using more hash functions increases the distinguishing power for duplicates (decreases F P S), but fills the SBF faster (increases F P S). 3.6 False Negative Rates A false negative(fn) is an error when a duplicate element is wrongly reported as distinct. It is generated only by duplicate elements, and is related to the input data distribution, especially the distribution of gaps. A gap is the number of elements between a duplicate and its nearest predecessor. Suppose a duplicate element x i whose nearest predecessor is x i δi (x i = x i δi ) is hashed to K cells, SBF [C i]... SBF [C ik]. An FN happens if any of those K cells is decremented to 0 during the δ i iterations when x i arrives. Let P R0(δ i, k ij) be the probability that cell C ij (j =... K) is decremented to 0 within the δ i iterations. This probability can be computed as in Eq. : where P R0(δ i, k ij) = δ i l=max P r(sbf δi = 0 A l ) = [P r(sbf δi = 0 A l )P r(a l )] + P r(sbf δi = 0 Āδ i )P r(āδ i ), l j=max (9) ( ) l p j ( p) l j, (0) j

6 P r(sbf δi = 0 Āδ i ) = P r(a l ) = ( k ij) l k ij, () δ i j=max ( ) δ i p j ( p) δi j, (2) j P r(āδ i ) = ( k ij) δ i, (3) and k ij is the probability that cell C ij is set to Max in each iteration. The meanings of the other symbols are the same as those in the proof of Theorem. Also, most of above equations are similar, except that Eq. 2 is different from Eq. 4. This is because the initial value of the cell in the case of Theorem is 0, but it is Max here. Furthermore, The probability that an FN occurs when x i arrives can be expressed as follows: P r(f N i) = K ( P R0(δ i, k ij)). (4) j= When δ i < Max, P R0(δ i, k ij) is 0, which means the FN rate is 0. Besides, for distinct elements who have no predecessors, the FN rates are 0. The value of δ i depends on the input data stream. In the next section, we discuss how to adjust the parameters to minimize the FN rate under the condition that the FP rate is bounded within a user-specified threshold. 4. FROM THEORY TO PRACTICE In the previous section we propose the SBF method and analytically study some of its properties: stability, convergence rate, monotonicity, stable point, FP rates (upper bound) and FN rates. In this section, we discuss how SBF can be used in practice and how our analytical results can be applied. 4. Parameters Setting Since FP rates can be bounded regardless of the input data but FN rates cannot, given a fix amount of space, we can choose a combination of Max, K and P that minimizes the number of FNs under the condition that the FP rate is within a user-specified threshold. Meanwhile we take into account the time spent on each element, which is crucial in many data stream applications. Overview of parameters setting. We have 3 parameters to set in our algorithm: Max, K and P. They are related to other parameters: FP rates, FN rates, SBF size m. Among these parameters, we assume that users specify m and the allowable FP rate. Based on the analysis in the previous section, we can obtain a formula computing the value of P from other parameters provided that Max and K have been chosen already. To set Max and K properly, we first derive the relationship between FN rates (the expected number of FNs), other parameters and the input data distribution. We find that the optimal value of K which minimizes the FN rates is independent of the input data. Thus, K can be set by trying different possible values on the formula we derive without considering the input data distribution, assuming Max is known. Last, we show that Max can be set empirically. The expected number of FNs. Since our goal is to minimize the number of FNs, we can compute the expected number of FNs, E(#F N), as the sum of FN rates for each duplicate element in the stream: E(#F N) = Ñ i= P r(f Ni), where Ñ is the number of duplicates in the stream. Combining it with Eq. 4 we have Ñ K E(#F N) = [ ( P R0(δ i, k ij))], (5) i= j= where δ i is the number of elements between x i and its predecessor, and k ij is the probability that cell C ij is set to Max in each iteration. C ij is the cell element x i is hashed to by the jth hash function. Since the function P R0(δ, k) is continuous, for each x i there must be a k i such that ( P R0(δ i, k i)) K = K ( P R0(δ i, k ij)). j= For the same reason, there must be an average δ and an average k such that Ñ[ ( P R0( δ, k)) K ] = Ñ [ ( P R0(δ i, k i)) K ] = E(#F N). i= Let f( δ, k) be the average FN rate, i.e. f( δ, k) = ( P R0( δ, k)) K. (6) Our task then becomes setting the parameters to minimize this average FN rate, f( δ, k), while bounding the FP rate within an acceptable threshold. The setting of P. Suppose users specify a threshold F P S, indicating the acceptable FP rate. This threshold establishes a constraint between the parameters: Max, K, P, m and F P S according to Corollary 5. Thus, users can set P based on the other parameters: P = ( )(/K /m). (7) ( F P S /K ) /Max Since m is usually much larger than K, /m is negligible in the above equation, which means that the setting of P is dominated only by F P S, Max, K, and is independent of the amount of space. The setting of K. Since the FP constraint can be satisfied by properly choosing P, we can set K such that it minimizes the number of FNs. From the above discussions we know the relationship between the expected number of FNs and the probabilities that cells are set to Max. Next, we connect these probabilities with our parameters K, m and the input stream. Suppose there are N elements in the stream of which n are distinct, and the frequency for each distinct element x l is f l. Clearly n l= f l = N. Assuming that the hash functions are uniformly at random, for a cell that element x i is hashed to, the number of times the cell is set to Max after seeing all N elements is a random variable, f i + n l= f l I l, where f i is the frequency of x i in the stream, and each I l (l =... n ) is an independent random variable following the Bernoulli

7 Max=, FPS=0., m=0^5, delta=200, Phi=0 Max=, FPS=0., m=0^7, delta=200, fxn= Max=, FPS=0., m=0^7, phi=0, fxn= FN rate FN rate FN rate e K Legend fxn=0.5 fxn=0. fxn=0.0 fxn= K Legend phi= 0.0 phi=0.0 phi= phi=0 phi= K Legend delta=0 delta=50 delta=00 delta=200 delta=0^7 Figure : FN rates vs. K distribution, i.e. I l = {, P r(i l = ) = K, m 0, P r(i l = 0) = K. m Thus, k i = N fi + n N l= f l I l is also a random variable. For the K cells an element x i is hashed to, the probabilities that those cells are set to Max in each iteration can be considered as K trials of k i. Since the mean and the variance of each I l are µ Il = K and m σ2 I l = K ( K ) respectively, m m it is not hard to derive that the mean and variance of k i: µ ki = N fi + K n N m l= f l = N fi + K ( fi) and m N σ2 k i = K ( K ) n N 2 m m l= f l 2. Let φ i = ki µ k i = K ( K ) m m k i N fi K ( fi) m N (8) K ( K ) m m be a transformation on k i. Then φ i [ N f i+ K m ( N f i) Km ( K m ), N f i m K ( N f i) Km ] is a random variable whose mean and ( m K ) variance are: µ φi = 0 and σφ 2 i = n N 2 l= f l 2. Note that σφ 2 i n N 2 l= (f l f max) < n N 2 l= (f l f max) = f max, where N f max is the frequency of the most frequent element in the stream. Since the mean and the variance of the random variable φ i are independent of K and m, we may consider φ i independent of K and m in practice. In other words, φ i can be seen as a property of the input stream. Similar to k i we can obtain a φ i such that ( P R0(δ i, φ K i)) K = ( P R0(δ i, φ ij)) = P r(f N i) j= (9) where φ i [Min(φ ij), Max(φ ij)], and φ ij are K trials of φ i(j =... K). Since the standard deviation of φ i is very small compared to the range of its possible values, and φ i is considered independent of K and m, φi can be approximately considered independent of K and m as well. For example, when f max i, m = N N 0 6 K 06, the value range of φ i is approximately [0, 000], while σ φi 0.. To set K, keeping all other parameters fixed we vary the values of K and compute the FN rate based on Eq. 9, 9, 7 and 8. By trying different combinations of parameter settings (Max =, 3, 7, 5, F P S = 0.2, 0., 0.0, 0.00, m = 000, 0 5, 0 7, 0 9, δ i = 0, 00, 000, 0 5, 0 7, 0 9, fxn = f i N = 0.5, 0., 0.0, 0.000, and φ i = 0.00, 0.,, 0, 00, 000,...), we find that once the values of F P S and Max are fixed, the value of the optimal or near optimal K is independent of the values of δ i, f i/n, φ i and m. Observation. The value of the optimal or near optimal K is dominated by Max and F P S. The input data and the amount of the space have little impact on it. Furthermore, the value is small (<0 in all of our testing). For example, when F P S = 0.2 and Max =, the value of the optimal or near optimal K is always between and 2; when F P S = 0. and Max = 3, it is always between 2 and 3; when F P S = 0.0 and Max = 3, it is always between 4 and 5. Therefore, without considering the input data stream we can pre-compute the FN rates for different values of K based on Max and F P S and choose the optimal one. Our experimental results reported in the next section are consistent with this observation. Figure shows an example of how the FN rates change with different values of K under different parameter settings based on Eq. 9 and Eq. 9. From the figure we can see that in the case of Max = and F P S = 0., we can set K to 2 regardless of the input stream and the amount of space. Therefore, in practice we can set φ i, δ i and f i/n to some testing values (e.g. 0, 200, respectively) and find the optimal or near optimal K using the formulas. The setting of Max. Based on the above discussion, we can set K regardless of the input data, but to choose a proper value of Max, we need to consider the input. More specifically, to minimize the expected number of FNs, we need to know the distributions of gaps in the stream to try different possible values of Max on Eq. 6 and 9. Since the expected value of φ i is 0 and its standard deviation is very small compared to its value domain, we set φ to 0 in the formulas. To effectively use the space we only set Max to 2 d (d is the number of bits/cell), otherwise the bits allocated for each cells are just wasted. Furthermore, in terms of the time cost, Max should be set as small as possible, because the larger Max is set, the larger P will be (see Eq.7, assuming K is a constant). For example, when Max =, F P S = 0.0 and K = 3(the optimal K), the computed value of P is 0; while Max = 5, F P S = 0.0, and K = 6 (the optimal K), the value of P computed is 4 (the value of P is not

8 sensitive to m). In practice, we limit our choice of Max to, 3 and 7 (if higher time cost can be tolerated, larger values of Max can be tried similarly). To choose a Max from these candidates, we try each value on Eq. 6 and Eq. 9, and find the one minimizing the average FN rate. Figure 2 depicts the difference of average FN rates between Max = 3 and Max = based on Eq.6 and Eq.9. We set f i = 0 because we are considering the entire stream rather N than a particular element in this case. The figure shows that if the values of gaps( δ) are smaller than a threshold, Max = 3 is a better choice. When the gaps become larger, Max = is better. If the gaps are large enough, there is not much difference between the two options. The figure shows the cases under different settings of φ, space sizes and acceptable FP rates. FN rate difference e+06.2e+06 delta computed based on F P S, m, Max and K; then find the optimal values of K for each case of Max(, 3, 7) by trying limited number( 0) of values of K on the FN rate formulas; Last, we estimate the expected number of FNs for each candidate value of Max using its corresponding optimal K and some prior knowledge of the stream, and thus choose the optimal value of Max. In the case that no prior knowledge of the input data is available, we suggest setting Max =. The described parameter setting process can be implemented within a few lines of codes. 4.2 Time Complexity Since our goal is to minimize the error rates given a fixed amount of space and an acceptable FP rate, we do not discuss space complexity, and just focus on time complexity. There are several parameters to be set in our method: K, Max and P. Within each iteration, we firstly need to probe K cells to detect duplicates. After that we pick P cells and decrement from them. Last we set the same K cells as probed in the first step to Max. Therefore, the time cost of our algorithm for handling each element is dominated by K and P. Theorem 4 (Time Complexity). Given that K and Max are constants, processing each data stream element needs O() time, independent of the size of the space and the stream. 0.2 Legend Space=0^7, FPS=0., phi=0 Space=0^7, FPS=0., phi=0.00 Space=0^7, FPS=0., phi= 0.00 space=2*0^7, FPS=0., phi=0 Space=0^7, FPS=0.0, phi=0 Figure 2: FN rates difference between Max = 3 and Max = (Max3 Max) vs. gaps. K is set to the optimal value respectively under different settings. We also tested the FN rate difference between Max = 7 and Max = 3, and observe the same general rule: a larger value of Max is better for smaller gaps, and a smaller F P S suggests a larger setting of Max. Similarly, we find no exceptions under other combinations of settings: F P S = 0.2, 0., 0.0, 0.00 and m = 000, 0 5, 0 7, 0 9. Trying different value of Max on Eq. 6 and Eq. 9, we set φ to 0 and assume that the distribution of the gaps are known. If the assumption cannot be satisfied in practice, we suggest setting Max to, because this setting often benefits a larger range of gaps in the stream. And our experiments also show that in most cases setting Max to achieves better improvements in terms of error rates compared to the alternative method, LRU buffering. In fact, buffering performs well when gaps are small, which is similar to the cases that Max is larger. The behavior of our SBF becomes closer to the buffering method when the value of Max is set larger. Summary of parameters setting. In practice, given an F P S, the amount of available space and the gap distribution of the input data, to set the parameters properly, we first establish a constraint for P, which means P can be Proof. From Eq.7 we know the constraint among K, P, m, Max and F P S(the user-specified upper bound of false positive rates). If K, Max and F P S are constants, the relationship between P and m is inversely proportional, which means m has no impact on the processing time. Since Max, K and F P S are all constants, the time complexity is O(). Based on the discussion of parameter settings, we know that the selection of K is insensitive to m. Furthermore, the value of m and the stream size have little impact on the selection of Max based on our testing on Eq. 6. Therefore, our algorithm needs O() time per element, independent of the size of space. 5. EXPERIMENTS In this section, we first describe our data set and the implementation details of 4 methods: SBF, Bloom Filter(BF), Buffering and FPBuffering (a variation of buffering which can be fairly compared to SBF). We then report some of the results on real data sets. We also ran experiments on some synthetic data sets, but due to the page limit, we can not show the results here. Last, we summarize the comparison between different methods. 5. Data Sets Real World Data. We simulated a web crawling scenario[8] as discussed in Section, using a Web crawl data set obtained from the Internet archive[2]. We hashed each URL in this collection to a 64-bit fingerprint using Rabin s method [27], as was done earlier [8]. With this fingerprinting technique, there is a very small chance that two different

9 URLs are mapped to the same fingerprint. We verified the data set and did not find any collisions between the URLs. In the end, we obtained a 28GB data file that contained about 700 million fingerprints of links, representing a stream of URLs encountered in a Web crawling process. 5.2 Implementation Issues SBF Implementation. Our algorithm is simple and straightforward to implement: ) hash each incoming stream element into K numbers, and check the corresponding K cells; 2) generate a random number, decrement the corresponding cell and (P-) cells adjacent to it by ; this process is faster than generating P random numbers for each element; although the processes of picking the P cells are not independent, each cell has a probability of P/m for being picked at each iteration. Our analysis still holds. 3) Set those K cells checked in step to Max. One issue we have to deal with is setting the parameters Max, K and P. According to the discussions of parameters setting in the previous section, we can set Max, K and P for a given F P S without considering the input data sets. For example, for FPS=0%, we set Max=, K=2 and P=4, which worked well for different data sets in our experiments. To evaluate our work, we implemented 3 alternative methods: Bloom Filters(BF), buffering and FPBuffering. Bloom Filters Implementation. In our implementation, BF becomes a special case of SBF where Max= and P=0. Knowing the number of distinct elements and the amount of space, we can compute the optimal K (see the discussion in Section 2.3). Buffering Implementation. Implementing buffering needs more work. First, to detect duplicates we need to search the buffer. To speed up the searching process, we used a hash table, as was done by Broder et al. [8]. Second, when the buffer is full, we have to choose a policy to evict an old element and make room for the newly coming one. Broder et al. [8] compared 5 replacement policies for caching Web crawls. They showed that LRU and Clock, the latter of which is used as an approximation of LRU, were the best practical choices for the URL data set (there were some ideal but impractical ones as well); in terms of miss rate (FN rate in our case), there was almost no difference between these two though. We chose LRU in our experiments. Both LRU and clock need a separate data structure for buffering elements, so that we can choose one for eviction [8]. For simplicity of the implementation, we used a double linked list, while Broder et al. chose a heap. This difference should not affect our experimental results since our error rate comparison did not account for the extra space we used in buffering. FPbuffering Implementation. To fairly and effectively compare our method to buffering method, we introduced a variation of buffering called FPbuffering. There are two reasons for this. First, SBF has both FPs and FNs while buffering has only FNs. In different applications the importance of FPs and FNs may be different. So it is hard to compare SBF to buffering directly. Second, the fraction of duplicates in the data stream is a dominant factor affecting the error rates, because FNs are only generated by duplicates and FPs by distincts. For buffering, a data stream full of duplicates will cause many FNs, while a stream consisting of all distincts cause no errors at all. FPbuffering works as follows: when a new data stream element arrives, we search it in the buffer. If found, report duplicate as in the original buffering; if not found, we report it as a duplicate with a probability q, and as a distinct with probability ( q). In the original buffering, if an element is not found in the buffer, it is always reported as a distinct. This variation can increase the overall error rates of buffering when there are more distincts in the stream, but can decrease the error rates when there are more duplicates in the stream. Clearly, FPbuffering has both FPs and FNs. In fact, q is the FP rate since a distinct element will be reported as duplicate with a probability q. By setting a common FP rate with SBF, we can fairly compare their FN rates, and this comparison will not be affected by the fraction of duplicates in the stream. In our experiments, we assumed that buffering and FPbuffering required 64 bits per URL fingerprint on the Web data (same as [8]). and 32 bits per element on the synthetic data simulating the size of an IP address. In other words, each element occupies 64 bits for the real data. and 32 bits for the synthetic data. 5.3 Theory Verification In an experiment to verify some of our theoretical results, we tested the stable properties of our SBF and the convergence rate. The results are shown in Figures 3. From the graph we can see that the fraction of zeros in the SBF decreases until it becomes stable. When the allocated space is small, the convergence rate is higher. This is because when the space is larger, the probability a cell being set is smaller. From Corollary 2 and Eq. 6 we know the convergence rate should be lower in this case. Also, we can see that when the SBF is stable, the fraction of zeros is still fluctuating slightly. This can be caused by the input data stream whose underlying distributions is varying. Furthermore, the fraction of 0s keeps decreasing in general before being stable; at this point, the FP rate should reach its maximum, and our theoretical upper bound for FP rates is also valid before the SBF become stable in this case. Our next experiments show the effectiveness of our theoretical FP bound. When the space is relatively small, the real FP rate is close to the bound. 5.4 Error Rates Comparison This experiment compared the error rates between SBF, FPbuffering, buffering and BF on the real data by varying the size of the space. The real data set contained 694,984,445 URL fingerprints, of which 4.75% were distinct. To do the comparison under different fractions of distinct elements, we built two more real data sets by using the first 00,000 and 0 million elements of the original data file. The fractions of distinct elements for these two data set respectively were 75.66% and 48.5%. For SBF, we set the acceptable FP rate (number of FPs/number of distincts), FPS, to 0%, and Max, K,P to, 2, 4 respectively. The results under different FPS settings will be shown in the next experiment. For FPbuffering, we set the FP rates to the same number as SBF so that both generated exactly the same number of FPs, and we can just compare their FN rates. Please note that buffering and BF only generate FNs and FPs respec-

10 be decreased by around 0-20%, which means the improvement from SBF may be equivalent of that from doubling the amount of space. The FP rates of BF is much higher than the acceptable FP rates in the first 2-3 rows of each table. Since buffering only generates FNs, it is not comparable to SBF here. But we can see that the FN rates of FPbuffering also decrease by introducing FPs into it. Figure 3: Fraction of zeros changed with time on the whole real data set (Max=, K=2, P=4, FPS=0%), space unit=64bits tively, and FPbuffering reduces the FN rates of buffering substantially in most cases by introducing a certain amount of FPs. However, we also notice that when the space is relatively large (the last row of each table), SBF performs not as good as buffering and BF. This is because when the space is large, BF might be able to hold all the distincts and keep a reasonable FP rates. We can directly compute the amount of space required based on the FP rates desired and the number of distincts in the data set according to the formula in Section 2.3. In this case, there is no need to evict information out of the BF, which means SBF is not applicable. If we can afford even more space, which is large enough to hold all the distincts in the data set using a buffer, there will be no errors at all. The last row of the second table shows this scenario. But in many data stream applications, a fast storage is needed to satisfy real time constraints and the size of this storage is typically less than the universe of the stream elements as discussed in Introduction. Another fact is that in both SBF and buffering, we can refresh the storage and bias it towards recent data; they both evict stale information continuously and keep those fresh elements. While BF is not applicable in this case since BF can be only used to represent a static data set. Thus, it is not useful in many data stream scenarios that require dynamic updates. Varying acceptable FP rates. Another experiment we ran was to test the effect of changing the acceptable FP rates. The results are shown in Figure 5. In this experiments, we set Max=3, K=4 when acceptable FP rates are set to 0.5% and %, and set Max=, K=2 when acceptable FP rates are set to 0% and 20%. The bar chart depicts the FN rate difference between FPbuffering and SBF. Again, the FP rates of both methods are set to the same number. Clearly, it shows that the more FPs are allow, the better SBF performs. Figure 4: Error rates comparison between SBF, FP- Buffering, Buffering and BF Comparison between different methods. The tables in Figure 4 show that when the space is relatively small, SBF is better. SBF beats FPbuffering by 3-3% in terms of FN rate on different data sets, when their FP rates are the same. For the problem we are studying, we think this amount of improvement is nontrivial for 2 reasons. First, Broder et al. [8] implemented a theoretically optimal buffering algorithm called MIN, which assumes the entire sequence of requests is known in advance, and accordingly chooses the best replacement strategy. Even this obviously impractical and ideal algorithm can only reduce the miss rates (FN rates in our case) of the LRU buffering, by no more than 5% in about 2/3 region (different buffer sizes). Second, from the tables we can see that even increasing the amount of the space by a factor of 4, the FN rates for buffering can Figure 5: FN rate differences between FPBuffering and SBF varying allowable FP rate(695m elements) 5.5 Time Comparison

11 As discussed in the implementation section, SBF and BF need O() time to process each element. The exact time depends on the the parameter settings. For example, when K=2 and P=4, SBF needs less than 0 operations within each iteration. For buffering and FPbuffering, their processing time is the same. It depends on 2 processes: element searching and element evicting. Searching can be quite expensive without an index structure. Both our experiments and those of Broder et al.[8] used a hash table to accelerate the search process. The extra space that is needed for a hash table to keep the search time constant is linear in the number of elements stored. The process of maintaining the LRU replacement policy(finding the least recently used element) is also costly, and extra space is needed to make it faster. This extra space can be quite large for LRU. However, this cost can be reduced to 2 bits per elements by using the Clock approximation of LRU [8]. Therefore, buffering and FPbuffering need extra space linear in the number of buffer entries to reach a similar O() processing time. But in our error rate comparison, we did not count this extra space for buffering and FPbuffering. 5.6 Methods Comparison Summary We compared 4 methods in this section: SBF, BF, FPbuffering and buffering. Among them, BF and buffering have only FPs and FNs respectively, and SBF and FPbuffering have errors of both sides. BF is a space efficient data structure which has been studied in the past and is widely used. It is good for representing a static set of data provided that the number of distinct elements is known. However, in data stream environments, the data is not static and it keeps changing. Usually it is hard to know the number of distinct elements in advance. Moreover, BF is not applicable in cases where dynamic updates are needed since elements can only be inserted into BF, but cannot be dropped out. Consequently, BF is not suitable for many data stream applications. Another variation of BF, counting BF, allows deletions, but it is still not applicable in the scenario we consider. See the discussion in the first paragraph of the related work section. SBF, buffering and FPBuffering can be all applied to data stream scenarios. SBF is better in terms of accuracy and time when certain amount of FP rates are acceptable and the space is relatively small, which is the case in many data stream applications due to the real-time constraint. When the space is relatively large or only tiny FP rates are allowed, buffering is better. 6. RELATED WORK The recent work of Metwally et al.[24]also study the duplicate detection problem in a streaming environment based on Bloom filters(bf) [7]. They consider different window models: Landmark windows, sliding windows and jumping windows. For the landmark window model, which is the scenario we consider, they apply the original Bloom filters without variations to detect duplicates, and thus do not consider the case that the BFs become full. For the sliding window model, they use counting BFs [8] (change bits into counters) to allow removing old information out of the Bloom filter. However, this can be done only when the element to be removed is known, which is not possible in many streaming cases. For example, if the oldest element needs to be removed, one has to know that which counters are touched by the oldest element, but this information cannot be found in counting BFs, and maintaining this knowledge can be quite expensive. For the jumping window model, they cut a large jumping window into multiple sub-windows, and represent both the jumping window and the sub-windows with counting BFs of the same size. Thus, the jumping window can jump forward by adding and removing sub-window BFs. Another solution for detecting duplicates in a streaming environment is the buffering or the caching method, which has been studied in many areas such as database systems, computer architecture, operating systems, and more recently URL caching in Web crawling [8]. We compare our method with those of Broder et al.[8] in the experiments. The problem of exact duplicate elimination is well studied, and there are many efficient algorithms(e.g. see [20] for details and references). For the problem of approximate membership testing in a non-streaming environment, the Bloom filter has been frequently used and occasionally extended [8, 25]. Cohen and Matias[4] extend the Bloom filter to answer multiplicity queries. Counting distinct elements using the Bloom filter is proposed by Whang et al.[3]. Another branch of duplicate detection focus on fuzzy duplicates [,, 6, 30], where the distinction between elements is not straightforward to see. A related problem to duplicate detection is counting the number of distinct elements. Flajolet and Martin[9] propose a bitmap sketch to address this problem in a large data set. The same problem is also studied by Cormode et al. [5], and a sketch based on stable random variables is introduced. Besides, the sticky sampling algorithm of Manku and Motwani [23] also randomly increment and decrement counters storing the frequencies of stream elements, but the decrement frequency is varying and not for each incoming element. Their goal is to find the frequent items in a data stream. As for data stream systems [3, 9, 29, 0, 6, 2], as far as we know, most of them divide the potentially unbounded data stream into windows with limited size and solve the problem precisely within the window. For example, Tucker et al. introduce punctuations into data streams, and thus duplicate eliminations could be implemented within data stream windows using traditional methods[29]. Since there is no way to store the entire history of an infinite data stream using limited space, our SBF essentially represents the most recent information by discarding those stale continuously. This is useful in many scenarios where the recent data is more important and this importance decays over time. A number of such kinds of applications are provided in [3] and [26]. Our motivating example of web crawling also has this property, since it may not matter that much to redundantly fetch a Web page that have been crawled a long time ago compared to fetching a page that have been crawled more recently.

12 7. CONCLUSIONS In this paper, we propose the SBF method to approximately detect duplicates for streaming data. When a certain false positive rate is allowed, SBF is superior in terms of both accuracy and time for a fixed amount of space compared to the alternative methods. Extending our work to handle sliding window queries is a future direction. Acknowledgement This work is supported by Natural Sciences and Engineering Research Council of Canada. We like to thank the anonymous reviewers for their comments and Ahmed Metwally for his feedback. 8. REFERENCES [] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of VLDB, [2] Internet Archive. [3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, [4] B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation queries over data streams. In Proc. of ICDE, [5] F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to cams? In Proc. of INFOCOMM, [6] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity m easures. In Proc. of KDD, [7] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. In CACM, 970. [8] A. Z. Broder, M. Najork, and J. L. Wiener. Efficient url caching for world wide web crawling. In Proc. of WWW, [9] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. B.Zdonik. Monitoring streams - a new class of data management applications. In Proc. of VLDB, [0] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In Proc. of CIDR, [] S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, [2] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In Proc. of SIGMOD, [3] E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of PODS, [4] S. Cohen and Y. Matias. Spectral bloom filters. In Proc. of SIGMOD, [5] G. Cormode, M. Datar, P. Indyk, and S.Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 5(3): , [6] C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In Proc. of SIGMOD, [7] C. Estan and G. Varghese. Data streaming in computer networks. In Proc. of Workshop on Management and Processing of Data Streams(MPDS) in cooperation with SIGMOD/PODS, [8] L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache:a scalable wide area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):28 293, [9] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 3(2):82 209, 985. [20] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, [2] A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4), 999. [22] Cisco System Inc. Cisco network accounting services. prodlit/nwact_wp.pdf, [23] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of VLDB, [24] A. Metwally, D. Agrawal, and A. E. Abbadi. Duplicate detection in click streams. In Proc. of WWW, [25] M. Mitzenmacher. Compressed bloom filters. IEEE/ACM Trans. Netw., 0(5):604 62, [26] T. Palpanas, M. Vlachos, E. J. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In Proc. of ICDE, [27] M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-5-8, Center for Research in Computing Technology, Harvard University, 98. [28] N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proc. of VLDB, [29] P. A. Tucker, D. Maier, and T. Sheard. Applying punctuation schemes to queries over continuous data streams. IEEE Data Eng. Bull., 26():33 40, [30] M. Weis and F. Naumann. Dogmatix tracks down duplicates in xml. In Proc. of SIGMOD, June [3] K. Whang, B. T. V. Zenden, and H. M.Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 5(2): , 990.

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks

The 28th International Conference on Distributed Computing Systems Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks Linfeng Zhang and Yong Guan Department of Electrical and

### Identifying Frequent Items in Sliding Windows over On-Line Packet Streams

Identifying Frequent Items in Sliding Windows over On-Line Packet Streams (Extended Abstract) Lukasz Golab lgolab@uwaterloo.ca David DeHaan dedehaan@uwaterloo.ca Erik D. Demaine Lab. for Comp. Sci. M.I.T.

### Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

### Duplicate Detection in Click Streams

Duplicate Detection in Click Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science, University of California at Santa Barbara Santa Barbara CA 93106 {metwally, agrawal,

### MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

### Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It

### Load Balancing and Switch Scheduling

EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

### Monitoring Large Flows in Network

Monitoring Large Flows in Network Jing Li, Chengchen Hu, Bin Liu Department of Computer Science and Technology, Tsinghua University Beijing, P. R. China, 100084 { l-j02, hucc03 }@mails.tsinghua.edu.cn,

### Counting Problems in Flash Storage Design

Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 2009-1-

### arxiv:1112.0829v1 [math.pr] 5 Dec 2011

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

### Scalable Bloom Filters

Scalable Bloom Filters Paulo Sérgio Almeida Carlos Baquero Nuno Preguiça CCTC/Departamento de Informática Universidade do Minho CITI/Departamento de Informática FCT, Universidade Nova de Lisboa David Hutchison

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

### Enhancing Collaborative Spam Detection with Bloom Filters

Enhancing Collaborative Spam Detection with Bloom Filters Jeff Yan Pook Leong Cho Newcastle University, School of Computing Science, UK {jeff.yan, p.l.cho}@ncl.ac.uk Abstract Signature-based collaborative

### Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

### Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

### Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Notes V General Equilibrium: Positive Theory In this lecture we go on considering a general equilibrium model of a private ownership economy. In contrast to the Notes IV, we focus on positive issues such

### Estimating Frequency of Change

Estimating Frequency of Change Junghoo Cho University of California, LA and Hector Garcia-Molina Stanford University Many online data sources are updated autonomously and independently. In this paper,

### OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

### P (A) = lim P (A) = N(A)/N,

1.1 Probability, Relative Frequency and Classical Definition. Probability is the study of random or non-deterministic experiments. Suppose an experiment can be repeated any number of times, so that we

### x if x 0, x if x < 0.

Chapter 3 Sequences In this chapter, we discuss sequences. We say what it means for a sequence to converge, and define the limit of a convergent sequence. We begin with some preliminary results about the

### Mathematics of Cryptography

CHAPTER 2 Mathematics of Cryptography Part I: Modular Arithmetic, Congruence, and Matrices Objectives This chapter is intended to prepare the reader for the next few chapters in cryptography. The chapter

### The Trip Scheduling Problem

The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

### Effective Page Refresh Policies For Web Crawlers

Effective Page Refresh Policies For Web Crawlers JUNGHOO CHO University of California, Los Angeles and HECTOR GARCIA-MOLINA Stanford University In this paper we study how we can maintain local copies of

### TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) *Slides adapted from a talk given by Nitin Vaidya. Wireless Computing and Network Systems Page

### Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.2

Job Reference Guide SLAMD Distributed Load Generation Engine Version 1.8.2 June 2004 Contents 1. Introduction...3 2. The Utility Jobs...4 3. The LDAP Search Jobs...11 4. The LDAP Authentication Jobs...22

### So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

### Factoring & Primality

Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

### Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture - 29 Voice over IP So, today we will discuss about voice over IP and internet

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Big Data & Scripting storage networks and distributed file systems

Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

### This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

### Prime Numbers. Chapter Primes and Composites

Chapter 2 Prime Numbers The term factoring or factorization refers to the process of expressing an integer as the product of two or more integers in a nontrivial way, e.g., 42 = 6 7. Prime numbers are

### ORDERS OF ELEMENTS IN A GROUP

ORDERS OF ELEMENTS IN A GROUP KEITH CONRAD 1. Introduction Let G be a group and g G. We say g has finite order if g n = e for some positive integer n. For example, 1 and i have finite order in C, since

### Section 3 Sequences and Limits

Section 3 Sequences and Limits Definition A sequence of real numbers is an infinite ordered list a, a 2, a 3, a 4,... where, for each n N, a n is a real number. We call a n the n-th term of the sequence.

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Examples of decision problems:

Examples of decision problems: The Investment Problem: a. Problem Statement: i. Give n possible shares to invest in, provide a distribution of wealth on stocks. ii. The game is repetitive each time we

### Competitive Analysis of On line Randomized Call Control in Cellular Networks

Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising

### By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal

### The Relation between Two Present Value Formulae

James Ciecka, Gary Skoog, and Gerald Martin. 009. The Relation between Two Present Value Formulae. Journal of Legal Economics 15(): pp. 61-74. The Relation between Two Present Value Formulae James E. Ciecka,

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### Alok Gupta. Dmitry Zhdanov

RESEARCH ARTICLE GROWTH AND SUSTAINABILITY OF MANAGED SECURITY SERVICES NETWORKS: AN ECONOMIC PERSPECTIVE Alok Gupta Department of Information and Decision Sciences, Carlson School of Management, University

### Deployment of express checkout lines at supermarkets

Deployment of express checkout lines at supermarkets Maarten Schimmel Research paper Business Analytics April, 213 Supervisor: René Bekker Faculty of Sciences VU University Amsterdam De Boelelaan 181 181

### Web Information Retrieval. Lecture 10 Crawling and Near-Duplicate Document Detection

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Today s lecture Crawling Duplicate and near-duplicate document detection Basic crawler operation Begin with known seed

### Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range

THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis

### Week 5 Integral Polyhedra

Week 5 Integral Polyhedra We have seen some examples 1 of linear programming formulation that are integral, meaning that every basic feasible solution is an integral vector. This week we develop a theory

CS787: Advanced Algorithms Lecture 14: Online algorithms We now shift focus to a different kind of algorithmic problem where we need to perform some optimization without knowing the input in advance. Algorithms

### Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

### On demand synchronization and load distribution for database grid-based Web applications

Data & Knowledge Engineering 51 (24) 295 323 www.elsevier.com/locate/datak On demand synchronization and load distribution for database grid-based Web applications Wen-Syan Li *,1, Kemal Altintas, Murat

### A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the

### 2004 Networks UK Publishers. Reprinted with permission.

Riikka Susitaival and Samuli Aalto. Adaptive load balancing with OSPF. In Proceedings of the Second International Working Conference on Performance Modelling and Evaluation of Heterogeneous Networks (HET

### Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project

Effects of Filler Traffic In IP Networks Adam Feldman April 5, 2001 Master s Project Abstract On the Internet, there is a well-documented requirement that much more bandwidth be available than is used

### Security in Outsourcing of Association Rule Mining

Security in Outsourcing of Association Rule Mining W. K. Wong The University of Hong Kong wkwong2@cs.hku.hk David W. Cheung The University of Hong Kong dcheung@cs.hku.hk Ben Kao The University of Hong

### Routing in Switched Networks

Routing in Switched Networks Chapter 12 CS420/520 Axel Krings Page 1 Routing in Circuit Switched Network Many connections will need paths through more than one switch Need to find a route Efficiency Resilience

### Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation

Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation ABSTRACT Ángel Cuevas Rumín Universidad Carlos III de Madrid Department of Telematic Engineering Ph.D Student

### Using simulation to calculate the NPV of a project

Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### Data Streams A Tutorial

Data Streams A Tutorial Nicole Schweikardt Goethe-Universität Frankfurt am Main DEIS 10: GI-Dagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:

### Double Sequences and Double Series

Double Sequences and Double Series Eissa D. Habil Islamic University of Gaza P.O. Box 108, Gaza, Palestine E-mail: habil@iugaza.edu Abstract This research considers two traditional important questions,

### Load Shedding for Aggregation Queries over Data Streams

Load Shedding for Aggregation Queries over Data Streams Brian Babcock Mayur Datar Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA 94305 {babcock, datar, rajeev}@cs.stanford.edu

### Time-Series Prediction with Applications to Traffic and Moving Objects Databases

Time-Series Prediction with Applications to Traffic and Moving Objects Databases Bo Xu Department of Computer Science University of Illinois at Chicago Chicago, IL 60607, USA boxu@cs.uic.edu Ouri Wolfson

### Efficient Large Flow Detection over Arbitrary Windows: An Algorithm Exact Outside An Ambiguity Region

Efficient Large Flow Detection over Arbitrary Windows: An Algorithm Exact Outside An Ambiguity Region Hao Wu University of Illinois at Urbana-Champaign Hsu-Chun Hsiao Carnegie Mellon University National

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### On the effect of forwarding table size on SDN network utilization

IBM Haifa Research Lab On the effect of forwarding table size on SDN network utilization Rami Cohen IBM Haifa Research Lab Liane Lewin Eytan Yahoo Research, Haifa Seffi Naor CS Technion, Israel Danny Raz

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### Signature Verification Why xyzmo offers the leading solution.

Dynamic (Biometric) Signature Verification The signature is the last remnant of the hand-written document in a digital world, and is considered an acceptable and trustworthy means of authenticating all

### Software-Defined Traffic Measurement with OpenSketch

Software-Defined Traffic Measurement with OpenSketch Lavanya Jose Stanford University Joint work with Minlan Yu and Rui Miao at USC 1 1 Management is Control + Measurement control - Access Control - Routing

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

18-345: Introduction to Telecommunication Networks Lectures 20: Quality of Service Peter Steenkiste Spring 2015 www.cs.cmu.edu/~prs/nets-ece Overview What is QoS? Queuing discipline and scheduling Traffic

### Mining Data Streams. Chapter 4. 4.1 The Stream Data Model

Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

### Master s Thesis. A Study on Active Queue Management Mechanisms for. Internet Routers: Design, Performance Analysis, and.

Master s Thesis Title A Study on Active Queue Management Mechanisms for Internet Routers: Design, Performance Analysis, and Parameter Tuning Supervisor Prof. Masayuki Murata Author Tomoya Eguchi February

### The Classical Architecture. Storage 1 / 36

1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

### CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

### 1 if 1 x 0 1 if 0 x 1

Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

### Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### It is time to prove some theorems. There are various strategies for doing

CHAPTER 4 Direct Proof It is time to prove some theorems. There are various strategies for doing this; we now examine the most straightforward approach, a technique called direct proof. As we begin, it

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### TCP in Wireless Mobile Networks

TCP in Wireless Mobile Networks 1 Outline Introduction to transport layer Introduction to TCP (Internet) congestion control Congestion control in wireless networks 2 Transport Layer v.s. Network Layer

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### Data Deduplication in Slovak Corpora

Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

### 18-731 Midterm. Name: Andrew user id:

18-731 Midterm 6 March 2008 Name: Andrew user id: Scores: Problem 0 (10 points): Problem 1 (10 points): Problem 2 (15 points): Problem 3 (10 points): Problem 4 (20 points): Problem 5 (10 points): Problem

### Math 317 HW #5 Solutions

Math 317 HW #5 Solutions 1. Exercise 2.4.2. (a) Prove that the sequence defined by x 1 = 3 and converges. x n+1 = 1 4 x n Proof. I intend to use the Monotone Convergence Theorem, so my goal is to show

### 2 Primality and Compositeness Tests

Int. J. Contemp. Math. Sciences, Vol. 3, 2008, no. 33, 1635-1642 On Factoring R. A. Mollin Department of Mathematics and Statistics University of Calgary, Calgary, Alberta, Canada, T2N 1N4 http://www.math.ucalgary.ca/

### Crawling and web indexes CE-324: Modern Information Retrieval Sharif University of Technology

Crawling and web indexes CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

### Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

### A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

### Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Moral Hazard Itay Goldstein Wharton School, University of Pennsylvania 1 Principal-Agent Problem Basic problem in corporate finance: separation of ownership and control: o The owners of the firm are typically

### Cloud Computing. Computational Tasks Have value for task completion Require resources (Cores, Memory, Bandwidth) Compete for resources

Peter Key, Cloud Computing Computational Tasks Have value for task completion Require resources (Cores, Memory, Bandwidth) Compete for resources How much is a task or resource worth Can we use to price