Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Size: px
Start display at page:

Download "Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters"

Transcription

1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta Davood Rafiei University of Alberta ABSTRACT Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time efficiency when a fixed small space and an acceptable false positive rate are given.. INTRODUCTION Eliminating duplicates is an important operation in traditional query processing, and many algorithms have been developed [20]. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed. Thus, multiple passes over data are possible, which is the case in a traditional database scenario. However, this assumption does not hold in a new class of applications, often referred to as data stream systems [3], which are becoming increasingly important. Consequently, detecting duplicates precisely is not always possible. Instead, it may suffice to identify duplicates with some errors. While it is useful to have duplicate elimination in a Data Stream Management System (DSMS)[3], some new properties of these systems make the duplicate detection problem more challenging and to some degree different from the one in a traditional DBMS. First, the timely response property of data stream applications requires the system to respond Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD 2006, June 27 29, 2006, Chicago, Illinois, USA. Copyright 2006 ACM /06/ $5.00. in real-time. There is no choice but to store the data in limited main memory rather than in huge secondary storage. Sometimes even main memory is not fast enough. For example, for network traffic measurement and accounting, ordinary memory (DRAM) is too slow to process each IP packet in time, and fast memory (on-chip SRAM) is small and expensive [7, 5]. Second, the potentially unbounded property of data streams indicates that it is not possible to store the whole stream in a limited space. As a result, exact duplicate detection is infeasible in such data stream applications. On the other hand, there are cases where efficiency is more important than accuracy, and therefore a quick answer with an allowable error rate is better than a precise one that is slow. Sometimes there is no way to have a precise answer at all. Therefore, load shedding is an important topic in data stream system research [28, 4]. Next, we provide some motivating examples.. Motivating Examples URL Crawling. Search engines regularly crawl the Web to enlarge their collections of Web pages. Given the URL of a page, which is often extracted from the content of a crawled page, a search engine must probe its archive to find out if the URL is in the engine collection and if the fetching of the URL can be avoided [8, 2]. One way to solve the problem is to store all crawled URLs in main memory and search for a newly encountered URL in it. However, the set of URLs can be too large to fit in memory. Partially storing URLs in the secondary storage is also not viable because of the large volume of searches that is expected to be performed within limited time. In practice, detecting duplicates precisely may not be indispensable. The consequence of an imprecise duplicate detection is that some already-crawled pages will be crawled again, or some new URLs which should be crawled are missed. The first kind of errors may lead the crawler to do some redundant crawling. This may not have a great influence on performance as long as the error rate is acceptable. For the second kind of errors, since a search engine can archive only a small portion of the entire web, a small miss rate is usually acceptable. In addition, if a missed URL refers to a high quality page, it is quite likely that the URL will be listed in the content of more than one crawled page, and there is

2 less chance of repeatedly missing it. The solution adopted by the Internet Archive crawler introduces the second kind of errors [2]. Selecting distinct IP addresses. In network monitoring and accounting, it is often important to understand the traffic and to identify the users on the network [22]. The following two queries, for example, may be interesting to network monitors: who are the users on the network within the past hour? Where do they go? The queries may be written as: Select distinct Source/Destination IP from IP PacketsStream Within Past Hour The result could be helpful for further analyzing the user profiles, interests and the network traffic. Because of the current high throughput of Internet routers and limited amount of fast memory, it is hard to capture per package information precisely. Often, sampling or buffering is used as a compromise, which is discussed in this paper. Duplicate detection in click streams. Recently, Metwally et al. propose another application for approximate duplicate detection in a streaming environment [24]. In a Web advertising scenario, advertisers pay web site publishers for clicks on their advertisements. For the sake of profit, it is possible that a publisher fakes some clicks (using scripts), hence a third party, the advertising commissioner, has to detect those false clicks by monitoring duplicate user IDs. We discuss more about their work in the related work section..2 Our Contributions In this paper, we propose Stable Bloom Filter(SBF), which extends and generalizes the regular Bloom filter, and accordingly, a novel algorithm which dynamically updates the sketch to represent recent data. We find and prove the stable properties of an SBF including stability, exponential convergence rate and monotonicity, based on which we show that using constant space, the chance of a false positive can be bounded to a constant independent of the stream size, and this constant is explicitly derived. Furthermore, we show that the processing time of SBF for each element in the stream is also constant independent of the stream size. To make our algorithm readily applicable in practice, we provide detailed discussions of the parameter setting issues both in theory and in experiments. And we compare our method to alternative methods using both real and synthetic data. The results show that our method is superior in terms of both accuracy and time efficiency when a fixed small space and an acceptable false positive rate are given..3 Roadmap In Section 2, we present the problem statement and some background on existing approaches. Our solution is presented and discussed in Section 3. In Section 4, we discuss how our algorithm can be used in practice. In Section 5, we verify our theoretical findings, experimentally evaluate our method and report the results of comparisons with alternative methods. Related work is reviewed in Section 6, and conclusions are given in Section PRELIMINARIES This section presents the problem statement in a data stream model and some possible solutions. 2. Problem Statement We consider a data stream as a sequence of numbers, denoted by S N = x,..., x i,..., x N, where N is the size of the stream. The value of N can be infinite, which means that the stream is not bounded. In general, a stream can be a sequence of records, but it is not hard to transform each record to a number (e.g., using hashing or fingerprinting) and use this stream model. Our problem can be stated as follows: given a data stream S N and a certain amount of space, M, estimate whether each element x i in S N appears in x,..., x i or not. Since our assumption is that M is not large enough to store all distinct elements in x,..., x i, there is no way to solve the problem precisely. Our goal is to approximate the answer and minimize the number of errors, including both false positives and false negatives, where a false positive is a distinct element wrongly reported as duplicate, and a false negative is a duplicate element wrongly reported as distinct. To address this problem, we examine two techniques that have been previously used in different contexts. 2.2 The Buffering Method A straightforward solution is to allocate a buffer and fill the buffer with enough elements of the stream. For each new element, the buffer can be checked, and the element may be identified as distinct if it is not found in the buffer, and duplicate otherwise. When the buffer is full, a newly arrived element may evict another element out of the buffer before it is stored. There are many replacement policies for choosing an element to be dropped out (e.g. [20]). Clearly, buffering introduces no false positives. We will use this method in our experiments and compare its performance to that of our method. 2.3 Bloom Filters Bloom [7] proposes a synopsis structure, known as the Bloom filter, to approximately answer membership queries. A Bloom filter, BF, is a bit array of size m, all of which are initially set to 0. For each element, K bits in BF are set to by a set of hash functions {h (x),..., h K(x)}, all of which are assumed to be uniformly independent. It is possible that one bit in BF is set multiple times, while only the first setting operation changes 0 into, and the rest have no effect on that bit. To know whether a newly arrived element x i has been seen before, we can check the bits {h (x i),..., h K(x i)}. If any one of these bits is zero, with 00% confidence we know x i is a distinct element. Otherwise, it is regarded as a duplicate with a certain probability of error. An error may occur because it is likely that the cells {h (x i),..., h K(x i)} are set before by elements other than x i. The probability of a false positive (false positive rate) F P = ( p) K, where p = ( /m) Kn is the probability that a particular cell is still zero after seeing n distinct elements. It is shown that when the number of hash functions K = ln(2)(m/n), this probability will be minimized to (/2) ln(2)(m/n),

3 where m is the number of bits in BF and n is the number of distinct elements seen so far [25]. 3. STABLE BLOOM FILTERS The Bloom filter is shown to be useful for representing the presence of a set of elements and answering membership queries, provided that a proper amount of space is allocated according to the number of distinct elements in the set. 3. The Challenge to Bloom Filters However, in many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros in the Bloom filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit,, where every distinct element will be reported as a duplicate, indicating that the Bloom filter is useless. Our general solution is to avoid the state where the Bloom filter is full by evicting some information from it before the error rate reaches a predefined threshold. This is similar to the replacement operation in the buffering method, in which there are several possible policies for choosing a past element to drop. In many real world data stream applications, often the recent data is more important than the older data [3, 26]. However, for the regular Bloom filter, there is no way to distinguish the recent elements from the past ones, since no time information is kept. Accordingly, we add a random deletion operation into the Bloom filter so that it does not exceed its capacity in a data stream scenario. 3.2 Our Approach To solve this problem, we introduce the Stable Bloom Filter, an extension of the regular Bloom filter. Definition (Stable Bloom Filter (SBF)). An SBF is defined as an array of integer SBF [],..., SBF [m] whose minimum value is 0 and maximum value is Max. The update process follows Algorithm. Each element of the array is allocated d bits; the relation between Max and d is then Max = 2 d. Compared to bits in a regular Bloom filter, each element of the SBF is called a cell. Concretely speaking, we change bits in the regular Bloom filter into cells, each consisting of one or more bits. The initial value of the cells is still zero. Each newly arrived element in the stream is mapped to K cells by some uniform and independent hash functions. As in a regular Bloom filter, we can check if a new element is duplicate or not by probing whether all the cells the element is hashed to are non-zero. This is the duplicate detection process. After that, we need to update the SBF. We first randomly decrement P cells by so as to make room for fresh elements; we then set the same K cells as in the detection process to Max. Our symbol list is shown in Table, and the detailed algorithm is described in Algorithm. 3.3 The Stable Property Based on the algorithm, we find an important property of SBF both in theory and in experiments: after a number of Algorithm : Approximately Detect Duplicates using SBF Data: A sequence of numbers S = x,..., x i,..., x N. Result: A sequence of Yes/No corresponding to each input number. begin initialize SBF[]... SBF[m] = 0 for each x i S do Probe K cells SBF [h (x i)]... SBF [h K(x i)] if none of the above K cells is 0 then DuplicateFlag = Yes else DuplicateFlag = No Select P different cells uniformly at random SBF [j ]... SBF [j P ], P {,..., m} for each cell SBF [j] {SBF [j ],..., SBF [j P ]} do if SBF [j] then SBF [j] = SBF [j] end for each cell {SBF [h (x i)],..., SBF [h K(x i)} do SBF [h(x i)] = Max Output DuplicateFlag Table : The Symbol List Symbols Meanings N Number of elements in the input stream M Total space available in bits m Number of cells in the SBF Max The value a cell is set to d Number of bits allocated per cell K Number of hash functions k The probability that a cell is set in each iteration P Number of cells we pick to decrement by in each iteration p The probability that a cell is picked to be decremented by in each iteration The ith hash function h i iterations, the fraction of zeros in the SBF will become fixed no matter what parameters we set at the beginning. We call this the stable property of SBF and deem it important to our problem because the false positive rate is dependent on the fraction of zeros in SBF. Theorem. Given an SBF with m cells, if in each iteration, a cell is decremented by with a probability p and set to Max with a probability k, the probability that the cells becomes zero after N iterations is a constant, provided that N is large enough, i.e. lim P r(sbfn = 0) N exists, where SBF N is the value of the cell at the end of iteration N. In our formal discussion, we assume that the underlying distribution of the input data does not change over time.

4 Our experiments on the real world data show that this is not a very strong assumption, and the experimental results verify our theory. Proof. Within each iteration, there are three operations: detecting duplicates, decreasing cell values and setting cells to Max. Since the first operation does not change the values of the cells, we just focus on the other two operations. Within the process of iterations from to N, the cell could be set 0,..., (N ) or even N times. Since the newest setting operation clears the impact of any previous operations, we can just focus on the process after the newest setting operation. Let A l denote the event that within the N iterations the most recent setting operation applied to the cell occurs at iteration N l, which means that no setting happened within the most recent l iterations (i.e. from iteration N l + to iteration N, l < N), and let ĀN denotes the event that the cell has never been set within the whole N iterations. Hence, the probability that the cell is zero after N iterations is as follows: N P r(sbf N = 0) = [P r(sbf N = 0 A l )P r(a l )] () l=max + P r(sbf N = 0 ĀN )P r(ān ), where P r(sbf N = 0 A l ) = l j=max ( ) l j p j ( p) l j (2) P r(a l ) = ( k) l k (3) P r(sbf N = 0 ĀN ) = (4) P r(ān ) = ( k)n. (5) We have Eq. 2 because during those l iterations, there is no setting operation, and the cell becomes zero if and only if it is decremented by no less than Max times. Clearly when l < Max, the cell is impossible to be decreased to 0, and l = N means ĀN happens, so we just consider the cases of Max l (N ) in Eq. 2. When ĀN happens, the cell is 0 with a probability because the initial value of the cell is 0 and it has never been set, therefore we have Eq. 4. Having the above equations, we can prove that lim N P r(sbf N = 0) exists. Due to the page limit, we put the detailed proof in an extended version of this paper. Having Theorem, now we can prove our stable property statement. Corollary (Stable Property). The expected fraction of zeros in an SBF after N iterations is a constant, provided that N is large enough. to that cell. Since the underlying distribution of the input data does not change, the probability that a particular element appears in each iteration is fixed. Therefore, the probability of each cell being set is fixed. Meanwhile, the probability that an arbitrary cell is decremented by is also a constant. According to Theorem, the probabilities of all cells in the SBF becoming 0 after N iterations are constants, provided that N is large enough. Therefore, the expected fraction of 0 in an SBF after N iterations is a constant, provided that N is large enough. Now we know an SBF converges. In fact this convergence is like the process that a buffer is filled by items continually. SBF is stable means that its maximum capacity is reached, similar to the case that a buffer is full of items. Another important property is the convergence rate. Corollary 2 (Convergence Rate). The expected fraction of 0s in the SBF converges at an exponential rate. Proof. From Eq., 4 and 5, we can derive P r(sbf N = 0) P r(sbf N = 0) =P r(sbf N = 0 A N )P r(a N ) + P r(ān ) P r(ān ) =k( k) N (P r(sbf N = 0 A N ) ) Clearly, Eq. 6 exponentially converges to 0. i.e. P r(sbf N [c]= 0) converges at an exponential rate, and this is true for all cells in the SBF. Therefore, the expected fraction of 0s in the SBF converges at an exponential rate. Lemma (Monotonicity). The expected fraction of 0s in an SBF is monotonically non-increasing. Proof. Since the value of Eq. 6 is always no greater than 0, the probability that a cell becomes zero is always decreasing or remains the same. Similar to the proof of Corollary 2, we can draw the conclusion. This lemma will be used to prove our general upper bound of the false positive rate where the number of iterations needs not to be infinity. 3.4 The Stable Point Currently we know the fraction of 0s in an SBF will be a constant at some point, but we do not know the value of this constant. We call this constant the stable point. Definition 2 (Stable Point). The stable point is defined as the limit of the expected fraction of 0s in an SBF when the number of iterations goes to infinity. When this limit is reached, we call SBF stable. (6) Proof. In each iteration, each cell of the SBF has a certain probability of being set to Max by the element hashed From Eq., we are unable to obtain the limit directly. However, we can derive it indirectly.

5 Theorem 2. Given an SBF with m cells, if a cell is decremented by with a constant probability p and set to Max with a constant probability k in each iteration, and if the probability that the cell becomes 0 at the end of iteration N is denoted by P r(sbf N = 0), lim P N r(sbfn = 0) = ( + p(/k ) ) Max (7) Proof. The basic idea is to make use of the fact that SBF is stable, the expected fraction of 0,,...,Max in SBF should be all constant. See the full paper for the details. The theorem can be verified by replacing the parameters in Eq. with some testing values. From Theorem 2 we know the probability that a cell becomes 0 when SBF is stable. If all cells have the same probability of being set, we can obtain the stable point easily. However, that requires the data stream to be uniformly distributed. Without this uniform distribution assumption, we have the following statement. Theorem 3 (SBF Stable Point). When an SBF is stable, the expected fraction of 0s in the SBF is no less than ( + P (/K /m) ) Max, where K is the number of cells being set to Max and P is the number of cells decremented by within each iteration. Proof. The basic idea is to prove the case of m = 2 first, and generalize it to m 2. See the full paper for the details. 3.5 False Positive Rates In our method, there could be two kinds of errors: false positives (FP) and false negatives (FN). A false positive happens when a distinct element is wrongly reported as duplicate; a false negative happens when a duplicate element is wrongly reported as distinct. We call their probabilities false positive rates and false negative rates. Corollary 3 (FP Bound when Stable). When an SBF is stable, the FP rate is a constant no greater than F P S, F P S = ( ( + P (/K /m) ) Max ) K. (8) Proof. If P r j(0) denotes the probability that the cell SBF [j] = 0 when the SBF is stable, the FP rate is ( m ( P r(0)) + + ( P rm(0))k m = ( (P r(0) + + P rm(0)))k m Please note that (P r(0) + + P rm(0)) is the expected m fraction of 0s in the SBF. According to Theorems and 3, the FP rate is a constant and Eq. 8 is an upper bound of the FP rate. This upper bound can be reached when the stream elements are uniformly distributed. Corollary 4 (The case of reaching the FP Bound). Given an SBF with m cells, if the stream elements are uniformly distributed when the SBF is stable, the FP rate is F P S (Eq. 8). Proof. Because elements in the input data stream are uniformly distributed, each cell in the SBF will have the same probability to be set to Max. According to Theorem and the proof of Theorem 3 we can derive this statement. Corollary 5 (General FP Bound). Given an SBF with m cells, F P S (Eq. 8) is an upper bound for FP rates at all time points, i.e. before and after the SBF becomes stable. Proof. This can be easily derived from Lemma and Corollary 3. Therefore, the upper bound for FP rates is valid no matter the SBF is stable or not. From Eq. 8 we can see that m has little impact on F P S, since /m is negligible compared to /K (m K). This means the amount of space has little impact on the FP bound once the other parameters are fixed. The value of P has a direct impact on F P S: the larger the value of P, the smaller the value of F P S. This can be seen intuitively: the faster the cells are cleared, the more 0s the SBF has, thus the smaller the value of F P S is. Oppositely, increasing the value of Max results in the increase of F P S. In contrast to P and Max, from the formula we can see the impact of the value of K on F P S is twofold: intuitively, using more hash functions increases the distinguishing power for duplicates (decreases F P S), but fills the SBF faster (increases F P S). 3.6 False Negative Rates A false negative(fn) is an error when a duplicate element is wrongly reported as distinct. It is generated only by duplicate elements, and is related to the input data distribution, especially the distribution of gaps. A gap is the number of elements between a duplicate and its nearest predecessor. Suppose a duplicate element x i whose nearest predecessor is x i δi (x i = x i δi ) is hashed to K cells, SBF [C i]... SBF [C ik]. An FN happens if any of those K cells is decremented to 0 during the δ i iterations when x i arrives. Let P R0(δ i, k ij) be the probability that cell C ij (j =... K) is decremented to 0 within the δ i iterations. This probability can be computed as in Eq. : where P R0(δ i, k ij) = δ i l=max P r(sbf δi = 0 A l ) = [P r(sbf δi = 0 A l )P r(a l )] + P r(sbf δi = 0 Āδ i )P r(āδ i ), l j=max (9) ( ) l p j ( p) l j, (0) j

6 P r(sbf δi = 0 Āδ i ) = P r(a l ) = ( k ij) l k ij, () δ i j=max ( ) δ i p j ( p) δi j, (2) j P r(āδ i ) = ( k ij) δ i, (3) and k ij is the probability that cell C ij is set to Max in each iteration. The meanings of the other symbols are the same as those in the proof of Theorem. Also, most of above equations are similar, except that Eq. 2 is different from Eq. 4. This is because the initial value of the cell in the case of Theorem is 0, but it is Max here. Furthermore, The probability that an FN occurs when x i arrives can be expressed as follows: P r(f N i) = K ( P R0(δ i, k ij)). (4) j= When δ i < Max, P R0(δ i, k ij) is 0, which means the FN rate is 0. Besides, for distinct elements who have no predecessors, the FN rates are 0. The value of δ i depends on the input data stream. In the next section, we discuss how to adjust the parameters to minimize the FN rate under the condition that the FP rate is bounded within a user-specified threshold. 4. FROM THEORY TO PRACTICE In the previous section we propose the SBF method and analytically study some of its properties: stability, convergence rate, monotonicity, stable point, FP rates (upper bound) and FN rates. In this section, we discuss how SBF can be used in practice and how our analytical results can be applied. 4. Parameters Setting Since FP rates can be bounded regardless of the input data but FN rates cannot, given a fix amount of space, we can choose a combination of Max, K and P that minimizes the number of FNs under the condition that the FP rate is within a user-specified threshold. Meanwhile we take into account the time spent on each element, which is crucial in many data stream applications. Overview of parameters setting. We have 3 parameters to set in our algorithm: Max, K and P. They are related to other parameters: FP rates, FN rates, SBF size m. Among these parameters, we assume that users specify m and the allowable FP rate. Based on the analysis in the previous section, we can obtain a formula computing the value of P from other parameters provided that Max and K have been chosen already. To set Max and K properly, we first derive the relationship between FN rates (the expected number of FNs), other parameters and the input data distribution. We find that the optimal value of K which minimizes the FN rates is independent of the input data. Thus, K can be set by trying different possible values on the formula we derive without considering the input data distribution, assuming Max is known. Last, we show that Max can be set empirically. The expected number of FNs. Since our goal is to minimize the number of FNs, we can compute the expected number of FNs, E(#F N), as the sum of FN rates for each duplicate element in the stream: E(#F N) = Ñ i= P r(f Ni), where Ñ is the number of duplicates in the stream. Combining it with Eq. 4 we have Ñ K E(#F N) = [ ( P R0(δ i, k ij))], (5) i= j= where δ i is the number of elements between x i and its predecessor, and k ij is the probability that cell C ij is set to Max in each iteration. C ij is the cell element x i is hashed to by the jth hash function. Since the function P R0(δ, k) is continuous, for each x i there must be a k i such that ( P R0(δ i, k i)) K = K ( P R0(δ i, k ij)). j= For the same reason, there must be an average δ and an average k such that Ñ[ ( P R0( δ, k)) K ] = Ñ [ ( P R0(δ i, k i)) K ] = E(#F N). i= Let f( δ, k) be the average FN rate, i.e. f( δ, k) = ( P R0( δ, k)) K. (6) Our task then becomes setting the parameters to minimize this average FN rate, f( δ, k), while bounding the FP rate within an acceptable threshold. The setting of P. Suppose users specify a threshold F P S, indicating the acceptable FP rate. This threshold establishes a constraint between the parameters: Max, K, P, m and F P S according to Corollary 5. Thus, users can set P based on the other parameters: P = ( )(/K /m). (7) ( F P S /K ) /Max Since m is usually much larger than K, /m is negligible in the above equation, which means that the setting of P is dominated only by F P S, Max, K, and is independent of the amount of space. The setting of K. Since the FP constraint can be satisfied by properly choosing P, we can set K such that it minimizes the number of FNs. From the above discussions we know the relationship between the expected number of FNs and the probabilities that cells are set to Max. Next, we connect these probabilities with our parameters K, m and the input stream. Suppose there are N elements in the stream of which n are distinct, and the frequency for each distinct element x l is f l. Clearly n l= f l = N. Assuming that the hash functions are uniformly at random, for a cell that element x i is hashed to, the number of times the cell is set to Max after seeing all N elements is a random variable, f i + n l= f l I l, where f i is the frequency of x i in the stream, and each I l (l =... n ) is an independent random variable following the Bernoulli

7 Max=, FPS=0., m=0^5, delta=200, Phi=0 Max=, FPS=0., m=0^7, delta=200, fxn= Max=, FPS=0., m=0^7, phi=0, fxn= FN rate FN rate FN rate e K Legend fxn=0.5 fxn=0. fxn=0.0 fxn= K Legend phi= 0.0 phi=0.0 phi= phi=0 phi= K Legend delta=0 delta=50 delta=00 delta=200 delta=0^7 Figure : FN rates vs. K distribution, i.e. I l = {, P r(i l = ) = K, m 0, P r(i l = 0) = K. m Thus, k i = N fi + n N l= f l I l is also a random variable. For the K cells an element x i is hashed to, the probabilities that those cells are set to Max in each iteration can be considered as K trials of k i. Since the mean and the variance of each I l are µ Il = K and m σ2 I l = K ( K ) respectively, m m it is not hard to derive that the mean and variance of k i: µ ki = N fi + K n N m l= f l = N fi + K ( fi) and m N σ2 k i = K ( K ) n N 2 m m l= f l 2. Let φ i = ki µ k i = K ( K ) m m k i N fi K ( fi) m N (8) K ( K ) m m be a transformation on k i. Then φ i [ N f i+ K m ( N f i) Km ( K m ), N f i m K ( N f i) Km ] is a random variable whose mean and ( m K ) variance are: µ φi = 0 and σφ 2 i = n N 2 l= f l 2. Note that σφ 2 i n N 2 l= (f l f max) < n N 2 l= (f l f max) = f max, where N f max is the frequency of the most frequent element in the stream. Since the mean and the variance of the random variable φ i are independent of K and m, we may consider φ i independent of K and m in practice. In other words, φ i can be seen as a property of the input stream. Similar to k i we can obtain a φ i such that ( P R0(δ i, φ K i)) K = ( P R0(δ i, φ ij)) = P r(f N i) j= (9) where φ i [Min(φ ij), Max(φ ij)], and φ ij are K trials of φ i(j =... K). Since the standard deviation of φ i is very small compared to the range of its possible values, and φ i is considered independent of K and m, φi can be approximately considered independent of K and m as well. For example, when f max i, m = N N 0 6 K 06, the value range of φ i is approximately [0, 000], while σ φi 0.. To set K, keeping all other parameters fixed we vary the values of K and compute the FN rate based on Eq. 9, 9, 7 and 8. By trying different combinations of parameter settings (Max =, 3, 7, 5, F P S = 0.2, 0., 0.0, 0.00, m = 000, 0 5, 0 7, 0 9, δ i = 0, 00, 000, 0 5, 0 7, 0 9, fxn = f i N = 0.5, 0., 0.0, 0.000, and φ i = 0.00, 0.,, 0, 00, 000,...), we find that once the values of F P S and Max are fixed, the value of the optimal or near optimal K is independent of the values of δ i, f i/n, φ i and m. Observation. The value of the optimal or near optimal K is dominated by Max and F P S. The input data and the amount of the space have little impact on it. Furthermore, the value is small (<0 in all of our testing). For example, when F P S = 0.2 and Max =, the value of the optimal or near optimal K is always between and 2; when F P S = 0. and Max = 3, it is always between 2 and 3; when F P S = 0.0 and Max = 3, it is always between 4 and 5. Therefore, without considering the input data stream we can pre-compute the FN rates for different values of K based on Max and F P S and choose the optimal one. Our experimental results reported in the next section are consistent with this observation. Figure shows an example of how the FN rates change with different values of K under different parameter settings based on Eq. 9 and Eq. 9. From the figure we can see that in the case of Max = and F P S = 0., we can set K to 2 regardless of the input stream and the amount of space. Therefore, in practice we can set φ i, δ i and f i/n to some testing values (e.g. 0, 200, respectively) and find the optimal or near optimal K using the formulas. The setting of Max. Based on the above discussion, we can set K regardless of the input data, but to choose a proper value of Max, we need to consider the input. More specifically, to minimize the expected number of FNs, we need to know the distributions of gaps in the stream to try different possible values of Max on Eq. 6 and 9. Since the expected value of φ i is 0 and its standard deviation is very small compared to its value domain, we set φ to 0 in the formulas. To effectively use the space we only set Max to 2 d (d is the number of bits/cell), otherwise the bits allocated for each cells are just wasted. Furthermore, in terms of the time cost, Max should be set as small as possible, because the larger Max is set, the larger P will be (see Eq.7, assuming K is a constant). For example, when Max =, F P S = 0.0 and K = 3(the optimal K), the computed value of P is 0; while Max = 5, F P S = 0.0, and K = 6 (the optimal K), the value of P computed is 4 (the value of P is not

8 sensitive to m). In practice, we limit our choice of Max to, 3 and 7 (if higher time cost can be tolerated, larger values of Max can be tried similarly). To choose a Max from these candidates, we try each value on Eq. 6 and Eq. 9, and find the one minimizing the average FN rate. Figure 2 depicts the difference of average FN rates between Max = 3 and Max = based on Eq.6 and Eq.9. We set f i = 0 because we are considering the entire stream rather N than a particular element in this case. The figure shows that if the values of gaps( δ) are smaller than a threshold, Max = 3 is a better choice. When the gaps become larger, Max = is better. If the gaps are large enough, there is not much difference between the two options. The figure shows the cases under different settings of φ, space sizes and acceptable FP rates. FN rate difference e+06.2e+06 delta computed based on F P S, m, Max and K; then find the optimal values of K for each case of Max(, 3, 7) by trying limited number( 0) of values of K on the FN rate formulas; Last, we estimate the expected number of FNs for each candidate value of Max using its corresponding optimal K and some prior knowledge of the stream, and thus choose the optimal value of Max. In the case that no prior knowledge of the input data is available, we suggest setting Max =. The described parameter setting process can be implemented within a few lines of codes. 4.2 Time Complexity Since our goal is to minimize the error rates given a fixed amount of space and an acceptable FP rate, we do not discuss space complexity, and just focus on time complexity. There are several parameters to be set in our method: K, Max and P. Within each iteration, we firstly need to probe K cells to detect duplicates. After that we pick P cells and decrement from them. Last we set the same K cells as probed in the first step to Max. Therefore, the time cost of our algorithm for handling each element is dominated by K and P. Theorem 4 (Time Complexity). Given that K and Max are constants, processing each data stream element needs O() time, independent of the size of the space and the stream. 0.2 Legend Space=0^7, FPS=0., phi=0 Space=0^7, FPS=0., phi=0.00 Space=0^7, FPS=0., phi= 0.00 space=2*0^7, FPS=0., phi=0 Space=0^7, FPS=0.0, phi=0 Figure 2: FN rates difference between Max = 3 and Max = (Max3 Max) vs. gaps. K is set to the optimal value respectively under different settings. We also tested the FN rate difference between Max = 7 and Max = 3, and observe the same general rule: a larger value of Max is better for smaller gaps, and a smaller F P S suggests a larger setting of Max. Similarly, we find no exceptions under other combinations of settings: F P S = 0.2, 0., 0.0, 0.00 and m = 000, 0 5, 0 7, 0 9. Trying different value of Max on Eq. 6 and Eq. 9, we set φ to 0 and assume that the distribution of the gaps are known. If the assumption cannot be satisfied in practice, we suggest setting Max to, because this setting often benefits a larger range of gaps in the stream. And our experiments also show that in most cases setting Max to achieves better improvements in terms of error rates compared to the alternative method, LRU buffering. In fact, buffering performs well when gaps are small, which is similar to the cases that Max is larger. The behavior of our SBF becomes closer to the buffering method when the value of Max is set larger. Summary of parameters setting. In practice, given an F P S, the amount of available space and the gap distribution of the input data, to set the parameters properly, we first establish a constraint for P, which means P can be Proof. From Eq.7 we know the constraint among K, P, m, Max and F P S(the user-specified upper bound of false positive rates). If K, Max and F P S are constants, the relationship between P and m is inversely proportional, which means m has no impact on the processing time. Since Max, K and F P S are all constants, the time complexity is O(). Based on the discussion of parameter settings, we know that the selection of K is insensitive to m. Furthermore, the value of m and the stream size have little impact on the selection of Max based on our testing on Eq. 6. Therefore, our algorithm needs O() time per element, independent of the size of space. 5. EXPERIMENTS In this section, we first describe our data set and the implementation details of 4 methods: SBF, Bloom Filter(BF), Buffering and FPBuffering (a variation of buffering which can be fairly compared to SBF). We then report some of the results on real data sets. We also ran experiments on some synthetic data sets, but due to the page limit, we can not show the results here. Last, we summarize the comparison between different methods. 5. Data Sets Real World Data. We simulated a web crawling scenario[8] as discussed in Section, using a Web crawl data set obtained from the Internet archive[2]. We hashed each URL in this collection to a 64-bit fingerprint using Rabin s method [27], as was done earlier [8]. With this fingerprinting technique, there is a very small chance that two different

9 URLs are mapped to the same fingerprint. We verified the data set and did not find any collisions between the URLs. In the end, we obtained a 28GB data file that contained about 700 million fingerprints of links, representing a stream of URLs encountered in a Web crawling process. 5.2 Implementation Issues SBF Implementation. Our algorithm is simple and straightforward to implement: ) hash each incoming stream element into K numbers, and check the corresponding K cells; 2) generate a random number, decrement the corresponding cell and (P-) cells adjacent to it by ; this process is faster than generating P random numbers for each element; although the processes of picking the P cells are not independent, each cell has a probability of P/m for being picked at each iteration. Our analysis still holds. 3) Set those K cells checked in step to Max. One issue we have to deal with is setting the parameters Max, K and P. According to the discussions of parameters setting in the previous section, we can set Max, K and P for a given F P S without considering the input data sets. For example, for FPS=0%, we set Max=, K=2 and P=4, which worked well for different data sets in our experiments. To evaluate our work, we implemented 3 alternative methods: Bloom Filters(BF), buffering and FPBuffering. Bloom Filters Implementation. In our implementation, BF becomes a special case of SBF where Max= and P=0. Knowing the number of distinct elements and the amount of space, we can compute the optimal K (see the discussion in Section 2.3). Buffering Implementation. Implementing buffering needs more work. First, to detect duplicates we need to search the buffer. To speed up the searching process, we used a hash table, as was done by Broder et al. [8]. Second, when the buffer is full, we have to choose a policy to evict an old element and make room for the newly coming one. Broder et al. [8] compared 5 replacement policies for caching Web crawls. They showed that LRU and Clock, the latter of which is used as an approximation of LRU, were the best practical choices for the URL data set (there were some ideal but impractical ones as well); in terms of miss rate (FN rate in our case), there was almost no difference between these two though. We chose LRU in our experiments. Both LRU and clock need a separate data structure for buffering elements, so that we can choose one for eviction [8]. For simplicity of the implementation, we used a double linked list, while Broder et al. chose a heap. This difference should not affect our experimental results since our error rate comparison did not account for the extra space we used in buffering. FPbuffering Implementation. To fairly and effectively compare our method to buffering method, we introduced a variation of buffering called FPbuffering. There are two reasons for this. First, SBF has both FPs and FNs while buffering has only FNs. In different applications the importance of FPs and FNs may be different. So it is hard to compare SBF to buffering directly. Second, the fraction of duplicates in the data stream is a dominant factor affecting the error rates, because FNs are only generated by duplicates and FPs by distincts. For buffering, a data stream full of duplicates will cause many FNs, while a stream consisting of all distincts cause no errors at all. FPbuffering works as follows: when a new data stream element arrives, we search it in the buffer. If found, report duplicate as in the original buffering; if not found, we report it as a duplicate with a probability q, and as a distinct with probability ( q). In the original buffering, if an element is not found in the buffer, it is always reported as a distinct. This variation can increase the overall error rates of buffering when there are more distincts in the stream, but can decrease the error rates when there are more duplicates in the stream. Clearly, FPbuffering has both FPs and FNs. In fact, q is the FP rate since a distinct element will be reported as duplicate with a probability q. By setting a common FP rate with SBF, we can fairly compare their FN rates, and this comparison will not be affected by the fraction of duplicates in the stream. In our experiments, we assumed that buffering and FPbuffering required 64 bits per URL fingerprint on the Web data (same as [8]). and 32 bits per element on the synthetic data simulating the size of an IP address. In other words, each element occupies 64 bits for the real data. and 32 bits for the synthetic data. 5.3 Theory Verification In an experiment to verify some of our theoretical results, we tested the stable properties of our SBF and the convergence rate. The results are shown in Figures 3. From the graph we can see that the fraction of zeros in the SBF decreases until it becomes stable. When the allocated space is small, the convergence rate is higher. This is because when the space is larger, the probability a cell being set is smaller. From Corollary 2 and Eq. 6 we know the convergence rate should be lower in this case. Also, we can see that when the SBF is stable, the fraction of zeros is still fluctuating slightly. This can be caused by the input data stream whose underlying distributions is varying. Furthermore, the fraction of 0s keeps decreasing in general before being stable; at this point, the FP rate should reach its maximum, and our theoretical upper bound for FP rates is also valid before the SBF become stable in this case. Our next experiments show the effectiveness of our theoretical FP bound. When the space is relatively small, the real FP rate is close to the bound. 5.4 Error Rates Comparison This experiment compared the error rates between SBF, FPbuffering, buffering and BF on the real data by varying the size of the space. The real data set contained 694,984,445 URL fingerprints, of which 4.75% were distinct. To do the comparison under different fractions of distinct elements, we built two more real data sets by using the first 00,000 and 0 million elements of the original data file. The fractions of distinct elements for these two data set respectively were 75.66% and 48.5%. For SBF, we set the acceptable FP rate (number of FPs/number of distincts), FPS, to 0%, and Max, K,P to, 2, 4 respectively. The results under different FPS settings will be shown in the next experiment. For FPbuffering, we set the FP rates to the same number as SBF so that both generated exactly the same number of FPs, and we can just compare their FN rates. Please note that buffering and BF only generate FNs and FPs respec-

10 be decreased by around 0-20%, which means the improvement from SBF may be equivalent of that from doubling the amount of space. The FP rates of BF is much higher than the acceptable FP rates in the first 2-3 rows of each table. Since buffering only generates FNs, it is not comparable to SBF here. But we can see that the FN rates of FPbuffering also decrease by introducing FPs into it. Figure 3: Fraction of zeros changed with time on the whole real data set (Max=, K=2, P=4, FPS=0%), space unit=64bits tively, and FPbuffering reduces the FN rates of buffering substantially in most cases by introducing a certain amount of FPs. However, we also notice that when the space is relatively large (the last row of each table), SBF performs not as good as buffering and BF. This is because when the space is large, BF might be able to hold all the distincts and keep a reasonable FP rates. We can directly compute the amount of space required based on the FP rates desired and the number of distincts in the data set according to the formula in Section 2.3. In this case, there is no need to evict information out of the BF, which means SBF is not applicable. If we can afford even more space, which is large enough to hold all the distincts in the data set using a buffer, there will be no errors at all. The last row of the second table shows this scenario. But in many data stream applications, a fast storage is needed to satisfy real time constraints and the size of this storage is typically less than the universe of the stream elements as discussed in Introduction. Another fact is that in both SBF and buffering, we can refresh the storage and bias it towards recent data; they both evict stale information continuously and keep those fresh elements. While BF is not applicable in this case since BF can be only used to represent a static data set. Thus, it is not useful in many data stream scenarios that require dynamic updates. Varying acceptable FP rates. Another experiment we ran was to test the effect of changing the acceptable FP rates. The results are shown in Figure 5. In this experiments, we set Max=3, K=4 when acceptable FP rates are set to 0.5% and %, and set Max=, K=2 when acceptable FP rates are set to 0% and 20%. The bar chart depicts the FN rate difference between FPbuffering and SBF. Again, the FP rates of both methods are set to the same number. Clearly, it shows that the more FPs are allow, the better SBF performs. Figure 4: Error rates comparison between SBF, FP- Buffering, Buffering and BF Comparison between different methods. The tables in Figure 4 show that when the space is relatively small, SBF is better. SBF beats FPbuffering by 3-3% in terms of FN rate on different data sets, when their FP rates are the same. For the problem we are studying, we think this amount of improvement is nontrivial for 2 reasons. First, Broder et al. [8] implemented a theoretically optimal buffering algorithm called MIN, which assumes the entire sequence of requests is known in advance, and accordingly chooses the best replacement strategy. Even this obviously impractical and ideal algorithm can only reduce the miss rates (FN rates in our case) of the LRU buffering, by no more than 5% in about 2/3 region (different buffer sizes). Second, from the tables we can see that even increasing the amount of the space by a factor of 4, the FN rates for buffering can Figure 5: FN rate differences between FPBuffering and SBF varying allowable FP rate(695m elements) 5.5 Time Comparison

11 As discussed in the implementation section, SBF and BF need O() time to process each element. The exact time depends on the the parameter settings. For example, when K=2 and P=4, SBF needs less than 0 operations within each iteration. For buffering and FPbuffering, their processing time is the same. It depends on 2 processes: element searching and element evicting. Searching can be quite expensive without an index structure. Both our experiments and those of Broder et al.[8] used a hash table to accelerate the search process. The extra space that is needed for a hash table to keep the search time constant is linear in the number of elements stored. The process of maintaining the LRU replacement policy(finding the least recently used element) is also costly, and extra space is needed to make it faster. This extra space can be quite large for LRU. However, this cost can be reduced to 2 bits per elements by using the Clock approximation of LRU [8]. Therefore, buffering and FPbuffering need extra space linear in the number of buffer entries to reach a similar O() processing time. But in our error rate comparison, we did not count this extra space for buffering and FPbuffering. 5.6 Methods Comparison Summary We compared 4 methods in this section: SBF, BF, FPbuffering and buffering. Among them, BF and buffering have only FPs and FNs respectively, and SBF and FPbuffering have errors of both sides. BF is a space efficient data structure which has been studied in the past and is widely used. It is good for representing a static set of data provided that the number of distinct elements is known. However, in data stream environments, the data is not static and it keeps changing. Usually it is hard to know the number of distinct elements in advance. Moreover, BF is not applicable in cases where dynamic updates are needed since elements can only be inserted into BF, but cannot be dropped out. Consequently, BF is not suitable for many data stream applications. Another variation of BF, counting BF, allows deletions, but it is still not applicable in the scenario we consider. See the discussion in the first paragraph of the related work section. SBF, buffering and FPBuffering can be all applied to data stream scenarios. SBF is better in terms of accuracy and time when certain amount of FP rates are acceptable and the space is relatively small, which is the case in many data stream applications due to the real-time constraint. When the space is relatively large or only tiny FP rates are allowed, buffering is better. 6. RELATED WORK The recent work of Metwally et al.[24]also study the duplicate detection problem in a streaming environment based on Bloom filters(bf) [7]. They consider different window models: Landmark windows, sliding windows and jumping windows. For the landmark window model, which is the scenario we consider, they apply the original Bloom filters without variations to detect duplicates, and thus do not consider the case that the BFs become full. For the sliding window model, they use counting BFs [8] (change bits into counters) to allow removing old information out of the Bloom filter. However, this can be done only when the element to be removed is known, which is not possible in many streaming cases. For example, if the oldest element needs to be removed, one has to know that which counters are touched by the oldest element, but this information cannot be found in counting BFs, and maintaining this knowledge can be quite expensive. For the jumping window model, they cut a large jumping window into multiple sub-windows, and represent both the jumping window and the sub-windows with counting BFs of the same size. Thus, the jumping window can jump forward by adding and removing sub-window BFs. Another solution for detecting duplicates in a streaming environment is the buffering or the caching method, which has been studied in many areas such as database systems, computer architecture, operating systems, and more recently URL caching in Web crawling [8]. We compare our method with those of Broder et al.[8] in the experiments. The problem of exact duplicate elimination is well studied, and there are many efficient algorithms(e.g. see [20] for details and references). For the problem of approximate membership testing in a non-streaming environment, the Bloom filter has been frequently used and occasionally extended [8, 25]. Cohen and Matias[4] extend the Bloom filter to answer multiplicity queries. Counting distinct elements using the Bloom filter is proposed by Whang et al.[3]. Another branch of duplicate detection focus on fuzzy duplicates [,, 6, 30], where the distinction between elements is not straightforward to see. A related problem to duplicate detection is counting the number of distinct elements. Flajolet and Martin[9] propose a bitmap sketch to address this problem in a large data set. The same problem is also studied by Cormode et al. [5], and a sketch based on stable random variables is introduced. Besides, the sticky sampling algorithm of Manku and Motwani [23] also randomly increment and decrement counters storing the frequencies of stream elements, but the decrement frequency is varying and not for each incoming element. Their goal is to find the frequent items in a data stream. As for data stream systems [3, 9, 29, 0, 6, 2], as far as we know, most of them divide the potentially unbounded data stream into windows with limited size and solve the problem precisely within the window. For example, Tucker et al. introduce punctuations into data streams, and thus duplicate eliminations could be implemented within data stream windows using traditional methods[29]. Since there is no way to store the entire history of an infinite data stream using limited space, our SBF essentially represents the most recent information by discarding those stale continuously. This is useful in many scenarios where the recent data is more important and this importance decays over time. A number of such kinds of applications are provided in [3] and [26]. Our motivating example of web crawling also has this property, since it may not matter that much to redundantly fetch a Web page that have been crawled a long time ago compared to fetching a page that have been crawled more recently.

12 7. CONCLUSIONS In this paper, we propose the SBF method to approximately detect duplicates for streaming data. When a certain false positive rate is allowed, SBF is superior in terms of both accuracy and time for a fixed amount of space compared to the alternative methods. Extending our work to handle sliding window queries is a future direction. Acknowledgement This work is supported by Natural Sciences and Engineering Research Council of Canada. We like to thank the anonymous reviewers for their comments and Ahmed Metwally for his feedback. 8. REFERENCES [] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proc. of VLDB, [2] Internet Archive. [3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, [4] B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation queries over data streams. In Proc. of ICDE, [5] F. Baboescu, S. Singh, and G. Varghese. Packet classification for core routers: Is there an alternative to cams? In Proc. of INFOCOMM, [6] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity m easures. In Proc. of KDD, [7] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. In CACM, 970. [8] A. Z. Broder, M. Najork, and J. L. Wiener. Efficient url caching for world wide web crawling. In Proc. of WWW, [9] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. B.Zdonik. Monitoring streams - a new class of data management applications. In Proc. of VLDB, [0] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. A. Shah. Telegraphcq: Continuous dataflow processing for an uncertain world. In Proc. of CIDR, [] S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In Proc. of ICDE, [2] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In Proc. of SIGMOD, [3] E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of PODS, [4] S. Cohen and Y. Matias. Spectral bloom filters. In Proc. of SIGMOD, [5] G. Cormode, M. Datar, P. Indyk, and S.Muthukrishnan. Comparing data streams using hamming norms (how to zero in). IEEE Trans. Knowl. Data Eng., 5(3): , [6] C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In Proc. of SIGMOD, [7] C. Estan and G. Varghese. Data streaming in computer networks. In Proc. of Workshop on Management and Processing of Data Streams(MPDS) in cooperation with SIGMOD/PODS, [8] L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache:a scalable wide area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):28 293, [9] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 3(2):82 209, 985. [20] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database System Implementation. Prentice Hall, [2] A. Heydon and M. Najork. Mercator: A scalable, extensible web crawler. World Wide Web, 2(4), 999. [22] Cisco System Inc. Cisco network accounting services. prodlit/nwact_wp.pdf, [23] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of VLDB, [24] A. Metwally, D. Agrawal, and A. E. Abbadi. Duplicate detection in click streams. In Proc. of WWW, [25] M. Mitzenmacher. Compressed bloom filters. IEEE/ACM Trans. Netw., 0(5):604 62, [26] T. Palpanas, M. Vlachos, E. J. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In Proc. of ICDE, [27] M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-5-8, Center for Research in Computing Technology, Harvard University, 98. [28] N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proc. of VLDB, [29] P. A. Tucker, D. Maier, and T. Sheard. Applying punctuation schemes to queries over continuous data streams. IEEE Data Eng. Bull., 26():33 40, [30] M. Weis and F. Naumann. Dogmatix tracks down duplicates in xml. In Proc. of SIGMOD, June [3] K. Whang, B. T. V. Zenden, and H. M.Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 5(2): , 990.

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

More information

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

More information

Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks

Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks The 28th International Conference on Distributed Computing Systems Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks Linfeng Zhang and Yong Guan Department of Electrical and

More information

Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Procedia Computer Science 00 (2012) 1 21. Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu. Procedia Computer Science 00 (2012) 1 21 Procedia Computer Science Top-k best probability queries and semantics ranking properties on probabilistic databases Trieu Minh Nhut Le, Jinli Cao, and Zhen He

More information

Identifying Frequent Items in Sliding Windows over On-Line Packet Streams

Identifying Frequent Items in Sliding Windows over On-Line Packet Streams Identifying Frequent Items in Sliding Windows over On-Line Packet Streams (Extended Abstract) Lukasz Golab lgolab@uwaterloo.ca David DeHaan dedehaan@uwaterloo.ca Erik D. Demaine Lab. for Comp. Sci. M.I.T.

More information

MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

More information

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps, centrally, a list of all the URL s it has found so far. It

More information

Duplicate Detection in Click Streams

Duplicate Detection in Click Streams Duplicate Detection in Click Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science, University of California at Santa Barbara Santa Barbara CA 93106 {metwally, agrawal,

More information

Monitoring Large Flows in Network

Monitoring Large Flows in Network Monitoring Large Flows in Network Jing Li, Chengchen Hu, Bin Liu Department of Computer Science and Technology, Tsinghua University Beijing, P. R. China, 100084 { l-j02, hucc03 }@mails.tsinghua.edu.cn,

More information

Counting Problems in Flash Storage Design

Counting Problems in Flash Storage Design Flash Talk Counting Problems in Flash Storage Design Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu NVRAMOS 09, Jeju, Korea, October 2009-1-

More information

Enhancing Collaborative Spam Detection with Bloom Filters

Enhancing Collaborative Spam Detection with Bloom Filters Enhancing Collaborative Spam Detection with Bloom Filters Jeff Yan Pook Leong Cho Newcastle University, School of Computing Science, UK {jeff.yan, p.l.cho}@ncl.ac.uk Abstract Signature-based collaborative

More information

Load Balancing and Switch Scheduling

Load Balancing and Switch Scheduling EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

More information

arxiv:1112.0829v1 [math.pr] 5 Dec 2011

arxiv:1112.0829v1 [math.pr] 5 Dec 2011 How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

More information

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In

More information

Estimating Frequency of Change

Estimating Frequency of Change Estimating Frequency of Change Junghoo Cho University of California, LA and Hector Garcia-Molina Stanford University Many online data sources are updated autonomously and independently. In this paper,

More information

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m. Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

More information

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Scalable Bloom Filters

Scalable Bloom Filters Scalable Bloom Filters Paulo Sérgio Almeida Carlos Baquero Nuno Preguiça CCTC/Departamento de Informática Universidade do Minho CITI/Departamento de Informática FCT, Universidade Nova de Lisboa David Hutchison

More information

P (A) = lim P (A) = N(A)/N,

P (A) = lim P (A) = N(A)/N, 1.1 Probability, Relative Frequency and Classical Definition. Probability is the study of random or non-deterministic experiments. Suppose an experiment can be repeated any number of times, so that we

More information

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP) TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) *Slides adapted from a talk given by Nitin Vaidya. Wireless Computing and Network Systems Page

More information

An Efficient Distributed Algorithm to Identify and Traceback DDoS Traffic

An Efficient Distributed Algorithm to Identify and Traceback DDoS Traffic Ó The Author 26. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org doi:1.193/comjnl/bxl26

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Effective Page Refresh Policies For Web Crawlers

Effective Page Refresh Policies For Web Crawlers Effective Page Refresh Policies For Web Crawlers JUNGHOO CHO University of California, Los Angeles and HECTOR GARCIA-MOLINA Stanford University In this paper we study how we can maintain local copies of

More information

The Trip Scheduling Problem

The Trip Scheduling Problem The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

More information

Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

More information

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29. Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture - 29 Voice over IP So, today we will discuss about voice over IP and internet

More information

x if x 0, x if x < 0.

x if x 0, x if x < 0. Chapter 3 Sequences In this chapter, we discuss sequences. We say what it means for a sequence to converge, and define the limit of a convergent sequence. We begin with some preliminary results about the

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Factoring & Primality

Factoring & Primality Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

More information

OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

More information

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand Notes V General Equilibrium: Positive Theory In this lecture we go on considering a general equilibrium model of a private ownership economy. In contrast to the Notes IV, we focus on positive issues such

More information

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

More information

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

By choosing to view this document, you agree to all provisions of the copyright laws protecting it. This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal

More information

Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.2

Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.2 Job Reference Guide SLAMD Distributed Load Generation Engine Version 1.8.2 June 2004 Contents 1. Introduction...3 2. The Utility Jobs...4 3. The LDAP Search Jobs...11 4. The LDAP Authentication Jobs...22

More information

Big Data & Scripting storage networks and distributed file systems

Big Data & Scripting storage networks and distributed file systems Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the

More information

Security in Outsourcing of Association Rule Mining

Security in Outsourcing of Association Rule Mining Security in Outsourcing of Association Rule Mining W. K. Wong The University of Hong Kong wkwong2@cs.hku.hk David W. Cheung The University of Hong Kong dcheung@cs.hku.hk Ben Kao The University of Hong

More information

On demand synchronization and load distribution for database grid-based Web applications

On demand synchronization and load distribution for database grid-based Web applications Data & Knowledge Engineering 51 (24) 295 323 www.elsevier.com/locate/datak On demand synchronization and load distribution for database grid-based Web applications Wen-Syan Li *,1, Kemal Altintas, Murat

More information

MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Deployment of express checkout lines at supermarkets

Deployment of express checkout lines at supermarkets Deployment of express checkout lines at supermarkets Maarten Schimmel Research paper Business Analytics April, 213 Supervisor: René Bekker Faculty of Sciences VU University Amsterdam De Boelelaan 181 181

More information

Competitive Analysis of On line Randomized Call Control in Cellular Networks

Competitive Analysis of On line Randomized Call Control in Cellular Networks Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising

More information

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

More information

2004 Networks UK Publishers. Reprinted with permission.

2004 Networks UK Publishers. Reprinted with permission. Riikka Susitaival and Samuli Aalto. Adaptive load balancing with OSPF. In Proceedings of the Second International Working Conference on Performance Modelling and Evaluation of Heterogeneous Networks (HET

More information

The Relation between Two Present Value Formulae

The Relation between Two Present Value Formulae James Ciecka, Gary Skoog, and Gerald Martin. 009. The Relation between Two Present Value Formulae. Journal of Legal Economics 15(): pp. 61-74. The Relation between Two Present Value Formulae James E. Ciecka,

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

14.1 Rent-or-buy problem

14.1 Rent-or-buy problem CS787: Advanced Algorithms Lecture 14: Online algorithms We now shift focus to a different kind of algorithmic problem where we need to perform some optimization without knowing the input in advance. Algorithms

More information

Data Streams A Tutorial

Data Streams A Tutorial Data Streams A Tutorial Nicole Schweikardt Goethe-Universität Frankfurt am Main DEIS 10: GI-Dagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:

More information

Signature Verification Why xyzmo offers the leading solution.

Signature Verification Why xyzmo offers the leading solution. Dynamic (Biometric) Signature Verification The signature is the last remnant of the hand-written document in a digital world, and is considered an acceptable and trustworthy means of authenticating all

More information

Software-Defined Traffic Measurement with OpenSketch

Software-Defined Traffic Measurement with OpenSketch Software-Defined Traffic Measurement with OpenSketch Lavanya Jose Stanford University Joint work with Minlan Yu and Rui Miao at USC 1 1 Management is Control + Measurement control - Access Control - Routing

More information

Efficient Large Flow Detection over Arbitrary Windows: An Algorithm Exact Outside An Ambiguity Region

Efficient Large Flow Detection over Arbitrary Windows: An Algorithm Exact Outside An Ambiguity Region Efficient Large Flow Detection over Arbitrary Windows: An Algorithm Exact Outside An Ambiguity Region Hao Wu University of Illinois at Urbana-Champaign Hsu-Chun Hsiao Carnegie Mellon University National

More information

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE YUAN TIAN This synopsis is designed merely for keep a record of the materials covered in lectures. Please refer to your own lecture notes for all proofs.

More information

Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project

Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project Effects of Filler Traffic In IP Networks Adam Feldman April 5, 2001 Master s Project Abstract On the Internet, there is a well-documented requirement that much more bandwidth be available than is used

More information

Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

More information

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis

More information

Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation

Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation Establishing How Many VoIP Calls a Wireless LAN Can Support Without Performance Degradation ABSTRACT Ángel Cuevas Rumín Universidad Carlos III de Madrid Department of Telematic Engineering Ph.D Student

More information

DETERMINANTS. b 2. x 2

DETERMINANTS. b 2. x 2 DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

More information

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom. Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

More information

From Spontaneous Total Order to Uniform Total Order: different degrees of optimistic delivery

From Spontaneous Total Order to Uniform Total Order: different degrees of optimistic delivery From Spontaneous Total Order to Uniform Total Order: different degrees of optimistic delivery Luís Rodrigues Universidade de Lisboa ler@di.fc.ul.pt José Mocito Universidade de Lisboa jmocito@lasige.di.fc.ul.pt

More information

1 if 1 x 0 1 if 0 x 1

1 if 1 x 0 1 if 0 x 1 Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

More information

2 Primality and Compositeness Tests

2 Primality and Compositeness Tests Int. J. Contemp. Math. Sciences, Vol. 3, 2008, no. 33, 1635-1642 On Factoring R. A. Mollin Department of Mathematics and Statistics University of Calgary, Calgary, Alberta, Canada, T2N 1N4 http://www.math.ucalgary.ca/

More information

Using simulation to calculate the NPV of a project

Using simulation to calculate the NPV of a project Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial

More information

Time-Series Prediction with Applications to Traffic and Moving Objects Databases

Time-Series Prediction with Applications to Traffic and Moving Objects Databases Time-Series Prediction with Applications to Traffic and Moving Objects Databases Bo Xu Department of Computer Science University of Illinois at Chicago Chicago, IL 60607, USA boxu@cs.uic.edu Ouri Wolfson

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

18-731 Midterm. Name: Andrew user id:

18-731 Midterm. Name: Andrew user id: 18-731 Midterm 6 March 2008 Name: Andrew user id: Scores: Problem 0 (10 points): Problem 1 (10 points): Problem 2 (15 points): Problem 3 (10 points): Problem 4 (20 points): Problem 5 (10 points): Problem

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Data Deduplication in Slovak Corpora

Data Deduplication in Slovak Corpora Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

Load Shedding for Aggregation Queries over Data Streams

Load Shedding for Aggregation Queries over Data Streams Load Shedding for Aggregation Queries over Data Streams Brian Babcock Mayur Datar Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA 94305 {babcock, datar, rajeev}@cs.stanford.edu

More information

On the effect of forwarding table size on SDN network utilization

On the effect of forwarding table size on SDN network utilization IBM Haifa Research Lab On the effect of forwarding table size on SDN network utilization Rami Cohen IBM Haifa Research Lab Liane Lewin Eytan Yahoo Research, Haifa Seffi Naor CS Technion, Israel Danny Raz

More information

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

More information

Lectures 5-6: Taylor Series

Lectures 5-6: Taylor Series Math 1d Instructor: Padraic Bartlett Lectures 5-: Taylor Series Weeks 5- Caltech 213 1 Taylor Polynomials and Series As we saw in week 4, power series are remarkably nice objects to work with. In particular,

More information

Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

On the Interaction and Competition among Internet Service Providers

On the Interaction and Competition among Internet Service Providers On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers

More information

TCP in Wireless Mobile Networks

TCP in Wireless Mobile Networks TCP in Wireless Mobile Networks 1 Outline Introduction to transport layer Introduction to TCP (Internet) congestion control Congestion control in wireless networks 2 Transport Layer v.s. Network Layer

More information

CLIC: CLient-Informed Caching for Storage Servers

CLIC: CLient-Informed Caching for Storage Servers CLIC: CLient-Informed Caching for Storage Servers Xin Liu University of Waterloo Ashraf Aboulnaga University of Waterloo Xuhui Li University of Waterloo Kenneth Salem University of Waterloo Abstract Traditional

More information

9.2 Summation Notation

9.2 Summation Notation 9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a

More information

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS

TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS TCOM 370 NOTES 99-4 BANDWIDTH, FREQUENCY RESPONSE, AND CAPACITY OF COMMUNICATION LINKS 1. Bandwidth: The bandwidth of a communication link, or in general any system, was loosely defined as the width of

More information

NP-Completeness and Cook s Theorem

NP-Completeness and Cook s Theorem NP-Completeness and Cook s Theorem Lecture notes for COM3412 Logic and Computation 15th January 2002 1 NP decision problems The decision problem D L for a formal language L Σ is the computational task:

More information

Optimal Tracking of Distributed Heavy Hitters and Quantiles

Optimal Tracking of Distributed Heavy Hitters and Quantiles Optimal Tracking of Distributed Heavy Hitters and Quantiles Ke Yi HKUST yike@cse.ust.hk Qin Zhang MADALGO Aarhus University qinzhang@cs.au.dk Abstract We consider the the problem of tracking heavy hitters

More information

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z DANIEL BIRMAJER, JUAN B GIL, AND MICHAEL WEINER Abstract We consider polynomials with integer coefficients and discuss their factorization

More information

Alok Gupta. Dmitry Zhdanov

Alok Gupta. Dmitry Zhdanov RESEARCH ARTICLE GROWTH AND SUSTAINABILITY OF MANAGED SECURITY SERVICES NETWORKS: AN ECONOMIC PERSPECTIVE Alok Gupta Department of Information and Decision Sciences, Carlson School of Management, University

More information

Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis

Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis January 8, 2008 FloCon 2008 Chris Roblee, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

Practical Calculation of Expected and Unexpected Losses in Operational Risk by Simulation Methods

Practical Calculation of Expected and Unexpected Losses in Operational Risk by Simulation Methods Practical Calculation of Expected and Unexpected Losses in Operational Risk by Simulation Methods Enrique Navarrete 1 Abstract: This paper surveys the main difficulties involved with the quantitative measurement

More information

Mining Data Streams. Chapter 4. 4.1 The Stream Data Model

Mining Data Streams. Chapter 4. 4.1 The Stream Data Model Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make

More information

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

More information

Differential privacy in health care analytics and medical research An interactive tutorial

Differential privacy in health care analytics and medical research An interactive tutorial Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could

More information

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID

SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID Renewable Energy Laboratory Department of Mechanical and Industrial Engineering University of

More information

An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

More information

Multi-service Load Balancing in a Heterogeneous Network with Vertical Handover

Multi-service Load Balancing in a Heterogeneous Network with Vertical Handover 1 Multi-service Load Balancing in a Heterogeneous Network with Vertical Handover Jie Xu, Member, IEEE, Yuming Jiang, Member, IEEE, and Andrew Perkis, Member, IEEE Abstract In this paper we investigate

More information

How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Low-rate TCP-targeted Denial of Service Attack Defense

Low-rate TCP-targeted Denial of Service Attack Defense Low-rate TCP-targeted Denial of Service Attack Defense Johnny Tsao Petros Efstathopoulos University of California, Los Angeles, Computer Science Department Los Angeles, CA E-mail: {johnny5t, pefstath}@cs.ucla.edu

More information