# Approximate Counts and Quantiles over Sliding Windows

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu Stanford University Gurmeet Singh Manku Stanford University ABSTRACT We consider the problem of maintaining -approximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements in the window is fixed (fixed-size sliding window) or variable (variable-size sliding window). In a fixed-size sliding window, both the ends of the window slide synchronously over the stream. In a variable-size sliding window, an adversary slides the window ends independently, and therefore has the ability to vary the number of elements in the window. We present various deterministic and randomized algorithms for approximate counts and quantiles. All of our algorithms require O( 1 polylog( 1, )) space. For quantiles, this space requirement is an improvement over the previous best bound of O( 1 polylog( 1, )). We believe that no previous work on space-efficient approximate counts over sliding windows exists. 1. ITRODUCTIO Data streams arise in several application domains like highspeed networking, finance, and transaction logs. For many applications, recent elements of a stream are more important than those that arrived a long time ago. This preference for recent elements is commonly expressed using a sliding window [], which identifies a portion of the stream that arrived between now and some recent time in the past. Computation of various statistics over the current contents of the window is an important operation [3, 9, 1, 17]. For many statistics, computing an exact answer requires the storage of the entire current window, which is infeasible for many practical stream rates and window sizes. Also, exact answers are not crucial for many applications; approximate answers suffice. This paper considers the problem of approximate, spaceefficient computation of two important statistics over sliding Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PODS 00 June 1-16, 00, Paris, France. Copyright 00 ACM X/0/06... \$5.00. windows frequency counts and quantiles. We believe that frequency-counts over sliding windows have not been studied before. Approximate quantiles over sliding windows have been studied by in et al [17]. Our algorithms beat the algorithms by in et al in terms of space complexity.. DEFIITIOS Frequency Counts An -approximate frequency count (or simply count) for a bag 1 B of elements is a set of e, f e pairs (e B) such that: (1) the approximate count f e is smaller than the true frequency f e of e in B, but by at most, i.e., (f e ) f e f e; () any element e B with a frequency f e appears in the set; other elements may or may not. Clearly, there could be more than one set that satisfies the above two properties. Approximate counts are important because they can be used to solve approximate frequent elements problem, which has applications in data mining and network monitoring [18]. The frequent elements problem over a bag B seeks the set of elements of B whose frequency exceeds s for a given threshold parameter s. The -approximate frequent elements problem allows some false positives elements whose frequency is greater than (s ) can appear in the answer. However, no false negatives are allowed; all elements whose frequency is greater than s must appear in the answer. It can be verified that an -approximate count for B can be used to find -approximate frequent elements of B if s. Quantiles Consider a bag B of elements. The rank of an element is its position in a sorted arrangement of elements of B. The rank of the minimum element is 1, while that of the maximum element is. The φ-quantile (φ (0, 1]) of B is the element with rank φ. The 0.5-quantile is called the median. An element is said to be an -approximate φ- quantile if its rank is between (φ ) and (φ + ) ; clearly there could be many elements that qualify. Streams and Sliding Windows A data stream is a continuously arriving sequence of elements drawn from some ordered domain. At any point in time, a sliding window over a stream is a bag of last elements 1 A bag is simply a multi-set, allowing multiple occurrences of an element.

3 bounds are O( log( )) for fixed-size windows and O( 1 log ()) for variable-size windows. In this paper, we improve upon both bounds, our space requirements being O( 1 log 1 log ) and O( 1 log 1 log log ) respectively. Randomized Quantile-Finding Algorithms For identifying exact quantiles in main memory, a simple linear time randomized algorithm was presented by Floyd and Rivest [1]. For approximate quantiles, randomization reduces the space requirements significantly. The key insight is the well-known fact that the φ-quantile of a random sample of size O( 1 log (δ) 1 ) is an -approximate φ-quantile of elements with probability at least 1 δ. This observation has been exploited, for example, by DeWitt et al [11] to identify splitters of large datasets in the context of distributed sorting. For data streams, a randomized quantilefinding algorithm was proposed by Manku et al [0] that requires only O( 1 log ( 1 log(δ) 1 )) space, beating the sampling bound. Further improvement in space is possible, as described in Section 6. Recently Cormode and Muthukrishnan [8] proposed a sketching technique called CM-sketches that can be used to maintain -approximate quantiles over streams and updateable relations. Their approach requires an advance knowledge of the domain U from which the elements of the stream are drawn. The space requirement for their approach is O( 1 log U log( U /(δ))), where U denotes the size of the domain U. Related Problems A problem related to approximate counting is the top-k problem, also known as the Hot Item problem, where the goal is to identify k items which are most frequent. Algorithms for the problem over data streams have been developed by Charikar et al [6], Cormode and Muthukrishnan [7] and Gibbons and Matias [13]. Sliding window algorithms have been developed for a variety of problems: bit-counting (Datar et al [9]), sampling (Babcock et al [3]), variance and k-medians (Datar et al [], distinct values and bit-counts (Gibbons and Tirthapura [1]) and quantiles (in et al [17]).. SIDIG-WIDOW SKETCHES Our presentation focuses on the data structures maintained by our algorithms, which we call sliding-window sketches. In this section, we define two general classes of sliding window sketches called unbounded-window sketches and bounded-window sketches. We then describe how these sketches can be used to derive algorithms for maintaining approximate statistics over fixed- and variable-size windows. Recall from Section that a sliding window can be defined as a sequence of two operations: insertion of a new stream element into the window when it arrives, and deletion of the oldest element in the window. Intuitively, a sliding-window sketch is maintained over a stream and allows some approximate statistic to be computed over a sliding window of the stream. Specifically, a sliding-window sketch supports three operations: (1) query, which computes approximate statistics over the current contents of the window; () insert, which updates the sketch when a new element is inserted into the window; and (3) delete, which updates the sketch when the oldest element is deleted from the window. For the definitions that follow, assume that the sketches are maintained over the stream of elements e 1, e, e 3,.... Definition 1. An unbounded-window sketch S is defined using a fixed error parameter, and is denoted S. S allows -approximate statistics to be computed over a variable-size window of the stream. The contents of S, when the current size of the stream is m and the current size of the window is, is denoted S (m, ). S (m, ) supports three operations: 1. query: Returns -approximate value of the maintained statistic over the bag of elements {e m +1, e m +,..., e m}.. insert: Constructs S (m + 1, + 1) from S (m, ) and the inserted element e m delete: Constructs S (m, 1) from S (m, ). Definition. A bounded-window sketch S is defined using two parameters an error parameter and a max-window parameter W and is denoted S W,. S W, allows approximate statistics to be computed over a variable-size window of the stream, but the size of the window is constrained to be W. The contents of S W,, when the current size of the stream is m and the current size of the window is, is denoted S W,(m, ). S W,(m, ) supports three operations: 1. query: Returns (W/)-approximate value of the maintained statistic over the bag of elements {e m +1, e m +,..., e m}.. insert: Constructs S W,(m+1, +1) from S W,(m, ) and the inserted element e m+1, if + 1 W. 3. delete: Constructs S W,(m, 1) from S W,(m, ). Both bounded- or unbounded-window sketches allows approximate statistics to be computed over variable-size windows. For bounded-window sketches, the size of the window is constrained to be less than a fixed max-window parameter W, while there are no such constraints for unboundedwindow sketches. We can use bounded- and unbounded-window sketches to derive algorithms for maintaining statistics over sliding windows as follows: An -approximate algorithm over a fixedsize window of size uses a bounded-window sketch S, (i.e., the max-window parameter W = ). Insertions to the window are handled using the insert operation and deletions using the delete operation of the sketch. From the definition of fixed-size windows, it follows that after m

4 = log (/) Highest level l = (+) l Error at level l l = W l Size of level-l block Table 1: ist of symbols used by C W, and Q W,. elements of the stream have been seen, the sketch maintained by the algorithm is of the form S,(m, ). By definition, this sketch allows / = -approximate statistics to be computed over the previous elements. Similarly, an -approximate algorithm over variable-size windows uses an unbounded-window sketch S, and handles insertions and deletions from the window using the insert and delete operations, respectively. Therefore, when the current size of the stream is m and the current size of the window is, the sketch maintained by the algorithm is S (m, ). By definition, this sketch allows -approximate statistics to be computed over the current elements in the window. ote that the definition of bounded-window sketches is more general than required by our fixed-size window algorithms. For example, S W,(m, ) allows to vary between 0 and W, while the fixed-size window algorithms always maintain W =. The generality is in fact, exploited for constructing unbounded-window sketches from bounded-window sketches, as we will describe in Section 7. We can define randomized versions of these two types of sketches in a straightforward manner: A randomized boundedwindow sketch S W,,δ (m, ) is similar to a bounded-window sketch S W,(m, ) except that the approximation guarantee of the query operation holds with probability at least (1 δ). A randomized unbounded-window sketch S,δ (m, ) is defined similarly. 5. DETERMIISTIC, BOUDED-WIDOW SKETCHES In this section, we present two deterministic, bounded-window sketches for counts and quantiles called C W, and Q W,, respectively, where W denotes the maximum window width and the error parameter. We assume that both 1/ and W are powers of to avoid floors and ceilings in expressions. In order to design C W, for arbitrary W and, we identify W and such that: (a) W and 1/ are powers of ; (b) W W W ; and (c) W W. We then use the sketch C W, in the place of C W,. By construction, C W, is more accurate than CW,, and we can show that both have the same asymptotic space complexity. We also assume that W (1/). Otherwise, a simple construction of C W, and Q W, is to store all the elements in the current window, and use these to compute exact counts and quantiles. The sketch does not allow -approximate statistics to be computed until m elements are seen. In order to be able to compute -approximate sketches at all times, the algorithm should use an unbounded-sketch for the first elements and switch to a bounded-sketch subsequently. et S denote the stream over which the sketches C W, and Q W, are maintained, and let e 1, e,... denote the sequence of elements of S. Therefore, C W,(m, ) (resp. Q W,(m, )) denotes the contents of the sketch C W, (resp. Q W,) when the current size of S is m and the current window size is. C W, and Q W, conceptually make +1 copies of the stream, where = log (/). We say that these copies are at different levels, which are numbered sequentially as 0, 1,...,. For each level, the copy of the stream for that level is divided into non-overlapping blocks. Within each level-l, all blocks have the same size. We denote the size of level-l blocks by l. We set 0 = W 3 and for l 1, l = l 1. It follows that l = l 0 = W l. ote that = W. Within a level, blocks are numbered 0, 1,,...; smaller numbered blocks contain older elements. For example, in level-0 block 0 contains the first W elements, block 1 the next W, and so on. In general, block b in level-l contains the bag of elements e i, i [b l W +1, (b+1)l W ]. Figure 1 schematically illustrates blocks and levels. At any given point in time, a block is assigned to one of four states, depending on m, the number of elements of S so far, and, the current window size. Consider block b of level-l, which contains the bag of elements e i, i [b l W + 1, (b + 1) l W ] = [l, r]. This block is active if all its elements belong to the current window (m < l < r m), expired if at least one of its elements is older than the last elements (l m ), under-construction if some of its elements belong to the current window, and all the remaining, yet to arrive (m < l m and r > m), and inactive if none of its elements has arrived (l > m). Each block (upto level-) goes through the sequence of states inactive, under-construction, active, and expired. C W,(m, ) and Q W,(m, ) are collections of block-level sketches, one sketch per block that is currently active or under-construction (i.e., when the current size of the stream is m and the current size of the window is ). A sketch for a level-l active block has an associated error parameter l = (+) ( l), and allows l -approximate statistics (counts/quantiles) to be computed over the elements belonging to that block. The error parameter decreases as the level increase, so sketches for higher level blocks are more accurate than sketches for lower level blocks. The sketches for the active blocks are merged by the query operation to produce approximate statistics (count/quantiles) over the current contents of the window. The details of the block-level sketches and the merge operation differ for C W,(m, ) and Q W,(m, ), and we present them separately in Sections 5.1 and 5. respectively. In order to construct a sketch for a block when it becomes active, C W, and Q W, run a traditional one-pass algorithm (Misra-Gries [1] for C W, and Greenwald-Khanna [15] for Q W,) over the elements of a block when it is in state underconstruction. When all the elements of a block arrive (i.e., the state of the block changes from under-construction 3 There is no sanctity to the number in W. It can be replaced with a different constant provided other constants in related expressions are changed suitably.

5 W (Time) evel evel 3 evel evel 1 evel 0 εw Expired Active Under construction m Figure 1: evels and Blocks used in C W,(m, ) and Q W,(m, ) to active), the one-pass algorithm is queried to produce a sketch for the block. Again, the details of running the one-pass algorithm and querying it differ for C W, and Q W,, and we present them separately in Sections 5.1 and 5.. ote that it is sufficient to start the one-pass algorithm for a block when the first element of the block arrives, i.e., when the block changes from the inactive to the underconstruction state. In order to implement this conceptual operation, C W, and Q W, maintain sketches corresponding to blocks under-construction. The sketch for a block is essentially the state maintained by the instance of the onepass algorithm for that block. 5.1 Details of C W, Recall that C W,(m, ) is a collection of sketches corresponding to blocks that are active or under-construction, when the current length of the stream is m and the current window size is. The sketch stored by C W,(m, ) for a level-l active block is just an l -approximate count over the elements belonging to the block. In other words, the sketch is a set of e, f e pairs, where e denotes an element belonging to the block and f e denotes an approximate frequency count for the element. Further the set satisfies the following two properties: (1) the approximate count f e is smaller than the true frequency f e of element e in the block, but by at most l l, i.e., (f e l l ) f e f e; () any element e belonging to the block with a frequency f e l l appears in the set. In order to construct a sketch for a level-l block when it becomes active, C W, runs an instance of Misra-Gries algorithm over the elements of the block with error parameter l, when the block is under-construction. When the block becomes active, C W, stores the output of the Misra-Gries algorithm as the sketch for the block. The output is simply the state used by the Misra-Gries algorithm, which is known to be O( 1 ) [1]. Thus the space requirements for both active and under-construction blocks at level l are O( 1 l ). emma 1. (Merge Operation for Counts) Using a collection of s sketches over disjoint bags of 1,,... s elements each, with error parameters 1,,... s, respectively, we can compute an -approximate count for the union of the bags with error parameter = s s s. Proof. et B 1,..., B s denote the s bags and A 1,..., A s denote the s sketches over these bags. By definition, each A i is an i-approximate count over the elements of B i. We construct an output approximate count A for the union of the bags as follows: An element e occurs as part of A iff e occurs as part of some A i (1 i s), i.e., A i contains a pair of the form e, f ei. If so, A contains the pair e, f e. The approximate count f e is the sum of the approximate counts f ei of e stored in each A i. (If e does not occur in A i, fei is defined to be 0.) By definition of A i, f ei i i f ei f ei, where f ei is the true frequency of e in bag B i. It follows that f e ( s s) f e f e, where f e = f e f es is the true frequency of e in the union of the bags. If the (true) frequency f e of an element e in the union of the bags is at least s s, then the frequency f ei of e in at least one of the bags B i exceeds i i, implying that e occurs in A i. By construction, e occurs in A as well. Therefore A is an approximate count for the union of the bags with error parameter s s s. Theorem 1. C W,(m, ) allows W -approximate counts to be computed over the bag of elements {e m +1,..., e m}. Further, C W,(m, ) uses O( 1 log 1 ) space. Proof. Approximation: et B w = {e m +1,..., e m} denote the bag of elements in the current window. et α be the smallest integer such that α W m. et β be the largest integer such that β W m. et B ig (for bag of ignored elements) denote the bag of all elements e i such that m < i α W or β W < i m. Intuitively, B ig denotes the bag of elements that belong to the current window but do not belong to any active block. et B con (for considered elements) denote the bag of all elements e i such that α W + 1 i β W. Bcon denotes the bag of elements belonging to the current window that also belong to some active block. ote that B w = B ig B con. We claim without proof that B con can be expressed as a union of k non-overlapping active blocks such that k ( + ). (See Figure 1 for an illustration.) et B 1,..., B k denote the bag of elements belonging to these blocks. et A 1,..., A k denote the sketches stored by C W,(m, ) for these blocks. et A ig denote the trivial (empty set) sketch for the bag of elements B ig with error parameter ig = 1. We

6 use emma 1 to compute an approximate count A over the bag of elements B w = B 1... B k B ig using the sketches A 1,..., A k, A ig. We now prove that A is an W -approximate count for Bw. et i denote the approximation parameter for sketch A i. et i denote the number of elements in B i, and let ig denote the number of elements in B ig. We first note that for any level-l, l l =, which is independent of l. W (+) W (+) for Since each A i is some block-level sketch, i i = (1 i k). Using emma 1 the approximation parameter for A (say A) is given by: A = ( ig ig + = ( ig + k ( ig + W ( W + W W k i=1 i i)/ (1) W )/ () ( + ) )/ (3) )/ () Equation 3 follows from the fact that k ( + ), and Equation from the fact that ig W (which can be shown from the definition of B ig). Space requirement: The space required by C W,(m, ) is the space required to store the block-level sketches for blocks that are currently active or under-construction. As we argued earlier, a sketch for a level-l block that is either active or under-construction requires O( 1 ) space. l From the definition of bounded-window sketches (Definition ), we have W. Therefore there are at most active blocks at level-0, active blocks at W/ level-1, and so on. In general, we can show that there are at most l active blocks at level-l. Hence, the total space required to store the sketches corresponding to all the active blocks is O( l=0 l / l ). Substituting l = (+) ( l), the required space is O( l=0 ( + )/), which is O( l=0 ) = O( ) = O( 1 log 1 ). (5) By definition, there is at most one block that is underconstruction for each level. Therefore, the total space required for sketches corresponding to blocks under construction is O( l=0 1/ l). Since, l l, O( l=0 1/ l) is geometric and is O( 1 ) = O( 1 log 1 ). Therefore the total space required by C W,(m, ) is O( 1 log 1 ). 5. Details of Q W, When a level-l block is under under-construction, Q W, conceptually runs an instance of the Greenwald-Khanna algorithm [15] with error parameter l over the elements of the block. When the block becomes active, Q W, computes φ-quantiles for φ = l, l,..., 1 using the instance of the Greenwald-Khanna algorithm, and stores the output sequence as the sketch for the block. It is a known result [15] that this sequence can be used to compute l -approximate quantiles over the elements of this block. From the construction above, a sketch for a level-l active block is just a sequence of 1 l elements, which requires O( 1 l ) space. An instance of Greenwald-Khanna algorithm over elements of a level-l block (with error parameter l ) requires O( 1 log l l ) space, which is the space required for a sketch l corresponding to a level-l block under-construction. emma. (Merge Operation for Approximate Quantiles) Using a collection of s sketches over disjoint bags of 1,,... s elements each, with error parameters 1,,... s, respectively, we can compute -approximate quantiles for the union of the bags with error parameter = s s s. Proof. For simplicity, we assume that 1/ i (1 i s) is an integer. Also, assume that the sketches are constructed as described earlier, i.e., the i-sketch for the i th bag contains i -approximate φ-quantiles for φ = i, i,..., 1, though the lemma is valid for the more general class of sketches introduced in [15]. An approximate φ-quantile, φ (0, 1], over the union of the bags can be computed as follows: Associate a weight of i i with each element of the i-th sketch. Sort the elements of all sketches and pick the element with the property that the sum of weights of all preceding elements is < φ( s), but the sum including the element s weight is φ( s). We claim without proof that this element is an -approximate φ-quantile for the union of the bags, where = k k k. Theorem. Q W,(m, ) allows W -approximate quantiles to be computed over the elements {e m +1,..., e m}. Further, Q W,(m, ) uses O( 1 log 1 log W ) space. Proof. Approximation: The proof is identical to that of Theorem 1, with emma 1 replaced by emma. Space requirement: The space required by Q W,(m, ) is the space required to store the block-level sketches for blocks that are currently active or under-construction. As we argued earlier, a sketch for a level-l active block requires O( 1 l ) space, while a level-l block under-construction requires O( 1 l log l l ) space. The analysis of the space required for the block-level sketches corresponding to all the active blocks is identical to the analysis that we presented for C W,: There are at most l active blocks at level-l. Hence, the total space required to store the sketches corresponding to all the active blocks is O( l=0 l / l ). Substituting l = (+) ( l), the required space is O( l=0 (+)/), which is O( l=0 O( ) = O( 1 log 1 ). ) = There is at most one block that is under-construction for each level. Therefore the total space required for sketches corresponding to the blocks under-construction is O( l=0 1 l log l l ). For any level-l, l l = W, which (+) is independent of l. So, the above sum reduces to W 1 O(log (+) l=0 l ). Since, l l, l=0 O( 1 l ) is geometric and is O( 1 ) = O( 1 log 1 ). The total space required for sketches for blocks under-construction is therefore

7 O( 1 log 1 log ). The overall space is O( 1 log(1/) log log 1 log W ) which can be simplified to the more conservative expression O( 1 log 1 log W ). log(/) 6. RADOMIZED, BOUDED-WIDOW SKETCHES In this section, we present two randomized, bounded-window sketches RC W,,δ for counts and RQ W,,δ for quantiles. RQ W,,δ is a minor variation of Q W,, while RC W,,δ is significantly different from C W,. For the remainder of this section, assume a stream S with elements e 1, e, Counts RC W,,δ is based on a sampling technique called sticky sampling proposed in [18]. et r denote the integer satisfying 1/ r ( 1 log(δ) 1 )/W < 1/ r 1. RC W,,δ logically partitions the stream into blocks of size r. For each block, RC W,,δ samples one element that belongs to the block uniformly at random. When the current length of the stream is m and the current size of the window is, the contents of RC W,,δ (m, ) are as follows: For each element e i (m < i m), belonging to the last positions, that was chosen as a sample for its block, store the triple e i, f i, i. The count f i denotes the number of occurrences of e i until it is sampled again, or if e i is not sampled subsequently, the number of occurrences of e i until the current end of the stream. Formally, fi is defined as the number of elements e j = e i (i j m) such that there does not exist an element e k = e i (i < k j) which was chosen as a sample for its block. W Theorem 3. RC W,,δ (m, ) allows -approximate counts to be computed over the elements {e m +1,..., e m} with probability at least (1 δ). Further, RC W,,δ (m, ) uses O( 1 log(δ) 1 ) space. Proof. Approximation: An approximate count A over the bag of elements {e m +1,..., e m} is computed using RC W,,δ (m, ) in a natural way: For each distinct element e that occurs as part of a triple in RC W,,δ (m, ), A contains a pair e, f e. The approximate count f e for element e is the sum of all f i such that e, f i, i is a triple in RC W,,δ (m, ). Consider an element e whose true frequency f e in the current window exceeds W. Assume that e does not occur as part of A or, if it does occur, fe < f e W, where f e is the approximate count of e stored in A. We can show that this happens only if the earliest W occurrences of e in the current window are not sampled. et the earliest W occurrences of e in the current window be spread over k different blocks. et c 1,..., c k denote the cumulative frequencies of these W occurrences in the k blocks. Then the probability that the first W occurrences of e are not sampled is equal to Π k i=1(1 c i/ r ). Since 1 c i/ r < (1 1/ r ) c i, this probability is smaller than Π k i=1(1 1/ r ) c i = (1 1/ r ) W < e W/r < (δ). Since W (from the definition of bounded-window sketches), there are at most 1 elements whose frequency in the current window exceeds W. The W probability that some element e among these (at most) 1 f e < f e W is less than δ 1 = δ. elements has Therefore with a probability at least (1 δ), all the elements of e whose frequency f e in the current window exceeds W appear as a pair e, f e in A, such that f e f e W. Further, for any pair e, f e that appears in A, we can easily show that f e f e, where f e is the true frequency of the element e in the current window. It follows that A is an W -approximate count over the current window. Space Requirement: The number of triples stored in RC W,,δ (m, ) is less than r + 1, which requires O( W r ) = O( 1 log(δ) 1 ) space. The sampling technique used in RC W,,δ is slightly different from the original sticky sampling. In the original sticky sampling each new element is sampled independently with probability 1/ r. The space requirement for this sampling scheme is O( 1 log(δ) 1 ) in expectation. With 1-in- r sampling, as described above, the space requirements become worst-case. 6. Quantiles RQ W,,δ is identical to Q W, described in Section 5, except that it uses a randomized algorithm that we call RQAlg instead of the deterministic Greenwald-Khanna algorithm. We first describe RQAlg, before presenting details of RQ W,,δ. A randomized 1-pass algorithm for computing -approximate quantiles over a stream of length M works as follows: We maintain a sample of size O( 1 log(δ) 1 ) using Vitter s reser- voir sampling [7]. It is well-known that exact quantiles of the sample are -approximate quantiles of the stream, with probability at least 1 δ. An algorithm that is easier to code is to select 1 out of k successive elements with k = log (M/( 1 log δ 1 )). Quantiles computed over the resulting samples are indeed -approximate, as shown in [0]. The space requirements for both Vitter s scheme and the 1- in- k scheme, are O( 1 log(δ) 1 ). For small values of, the inverse-square dependence on leads to blowup in space requirements, making the algorithms impractical. Two approaches have been devised in recent years to combat this problem: (a) Gibbons and Matias [13] show how samples can be compressed if the stream abounds with duplicates, and (b) Manku et al [19] suggest a two-stage pipeline: the output of 1-out-of- k sampling can be fed to a deterministic quantile-finding algorithm, with the error parameter being / for the two stages. By employing the Greenwald-Khanna algorithm [15] in the second stage, we arrive at a randomized algorithm requiring only O( 1 log( 1 log(δ) 1 )) space. We call this algorithm RQAlg(, δ), parameterized by and δ. RQ W,,δ (m, ) stores block-level sketches for blocks that are currently active or under-construction, exactly like Q W,(m, ). When a level-l block is under-construction, If M is not known in advance, k is not fixed but grows logarithmically with M. However, we claim that it is possible to employ the adaptive sampling scheme Manku et al [0] in conjunction with a modified Greenwald-Khanna algorithm to arrive at the same bound.

8 δ (+) RQ W,,δ runs an instance of RQAlg( l, ) over the elements of the block. When the block becomes active, RQ W,,δ computes φ-quantiles for φ = l, l,..., 1 using the above instance of RQAg. W Theorem. RQ W,,δ (m, ) allows -approximate quantiles to be computed over the elements {e m +1,..., e m} with probability at least (1 δ). Further, RQ W,,δ (m, ) uses O( 1 log 1 log( 1 log 1 )) space, where = (δ)/ log(1/). Proof. (Sketch) Algorithm RQAlg guarantees that each block-level sketch has the desired error parameter with probability at least (1 δ ). In order to compute an approximate quantile, at most ( + ) block-level sketches are (+) merged. With probability at least (1 δ), each of these have the desired error parameter. Space requirements for active blocks are O( 1 log 1 ) (see Theorem for a proof). Space-requirements for blocks underconstruction are only O( 1 (log 1 ) log( 1 log 1 )), where = O(δ/ log(1/)). Summing the two, we get the overall space-complexity. 7. UBOUDED-WIDOW SKETCHES We present a general technique for constructing unboundedwindow sketches from bounded-window sketches that satisfy a certain property. All of our bounded-window sketches presented in Sections 5 and 6 satisfy this property, and therefore we can obtain a unbounded-window sketch corresponding to each one of our bounded-window sketches. 7.1 General Technique et F denote a bounded-window sketch that satisfies the following property: Property P: For any positive integers k and m, and real (0, 1], F k+1,(m, k ) can be constructed using F k,(m, k ). For any error parameter, an unbounded-window sketch V can be constructed from F as follows: V (m, ) is the collection of the following log bounded-window sketches, where k is the integer satisfying k 1 < k : {F k, (m, ), F k 1, (m, k 1 ),..., F, (m, )} The three basic operations query, insert, and delete of V (m, ) are performed as follows: query: To compute -approximate statistics for the current window, we use the query operation of the sketch F k, (m, ). By definition, the query operation of (/)k (m, ) produces -approximate statistics. Since F k, > k /, the approximation (/)k <, as required. insert: We compute V (m + 1, + 1) from V (m, ) when an element e m+1 is inserted as follows: We insert the element e m+1 into the sketch F k, (m, ) using its insert operation to produce F k, (m + 1, + 1). For all the remaining sketches {F k 1, (m, k 1 ),..., F, (m, )}, we first perform a delete operation and then insert the element e m+1 using their insert operations. This sequence of operations results in the following collection of sketches: {F k, (m+1, +1), F k 1, (m+1, k 1 ),..., F, (m+1, )} Further, if +1 = k, we construct F k+1, (m+1, k ) from F k, (m+1, k ) and add it to the collection of sketches. This step is possible since F satisfies Property P. delete: We compute V (m, 1) as follows: We perform a delete operation for the sketch F k, (m, ). All other sketches remain unchanged. This results in the collection of sketches: {F k, (m, 1), F k 1, (m, k 1 ),..., F, (m, )} Further, if 1 < k 1, we drop the sketch F k, (m, 1) from the collection of sketches. 7. Specific Constructions In this section we show that all the bounded-window sketches that we presented in Sections 5 and 6 satisfy Property P. emma 3. The bounded-window sketches C, Q and RQ satisfy Property P. Proof. We present the proof only for the bounded-window sketch C. The proofs for Q and RQ are very similar. In order to prove that C satisfies Property P, we need to show that C k+1,(m, k ) can be constructed from C k,(m, k ). Intuitively, this construction is feasible since C k,(m, k ) is a more accurate sketch than C k+1,(m, k ): Both sketches compute statistics over the same window, but C k+1,(m, k ) allows -approximate statistics to be computed over the current window, while C k+1,(m, k ) only allows -approximate statistics to be computed. For presentation clarity, we assume that no block of C k,(m, k ) is under-construction. This is equivalent to assuming that m is an exact multiple of k. The proof for the more general case uses essentially the same ideas as the proof presented here. et = log ( ) denote the number of different levels for blocks: this is the same for both C k+1,(m, k ) and C k,(m, k ), since is a function of alone. For l <, we can show that each level-l active block of C k+1,(m, k ) is the same as some level-(l + 1) active block of C k,(m, k ). For this common block, C k,(m, k ) stores a sketch with error parameter l+1, while C k+1,(m, k ) stores a sketch with error parameter l. By definition, l+1 = l. In general, for counts, we can show that a sketch for a bag of elements with error parameter can be constructed using a sketch for the same bag with error parameter. (This is shown formally in emma.) Therefore, for l <, we can construct a

9 k+1 k evel 3 evel evel 1 evel 0 evel 3 evel evel 1 evel 0 Figure : Construction of C k+1,(m, k ) from C k,(m, k ) sketch for a level-l active block of C k+1,(m, k ) using the sketch maintained for the same block by C k,(m, k ). (See Figure for an illustration.) We now consider level- blocks of C k+1,(m, k ). Since the current window size for C k+1,(m, k ) is k, and the size of a level- block in C k+1,(m, k ) is k+1 (by definition), there can be no active level- block in C k+1,(m, k ). We consider two possible cases depending on whether C k+1,(m, k ) contains a level- block under-construction or not. If C k+1,(m, k ) does not contain a level- block under -construction, we do not need to do anything further, since we have constructed a valid block-level sketch for every block of C k+1,(m, k ) that is either under-construction or active. In order to be able to handle the second case, where C k+1,(m, k ) contains a level- block under-construction, we make one minor modification to the running of the onepass Misra-Gries algorithm over blocks under-construction: When a block becomes active, we do not immediately terminate the algorithm and discard the sketches as described in Section 5.1, but do so only when the next block in the same level reaches under-construction state. (This change translates to slightly modifying the definitions of active and under-construction blocks, so that, at any point in time, there is always a block in each level in the under-construction state.) one of the results so far are affected by this modification. With this modification, an instance of the one-pass algorithm, capable of producing an -sketch is running over the elements of the single active level- block of C k,(m, k ). We simply continue this algorithm on behalf of the level- block of C k+1,(m, k ) that is under-construction, and after the next k elements arrive, this algorithm can be used to produce an -sketch. emma. For quantiles (resp. counts), a sketch with error parameter for a bag that uses O( 1 ) space can be obtained from a sketch for the same bag that has error param- eter. Proof. First consider quantiles. An -sketch can be constructed by probing sketch for quantiles φ =,,..., 1 and storing the returned approximate quantiles. this sketch uses O( 1 ) space. Clearly, For counts, retrieve all pairs e, f e in the -sketch, and store as part of the -sketch those pairs with f e M/, where M is the size of the bag. At most elements can have f e M/. (Recall that the approximate count f e the frequency). Therefore, size of the constructed sketch is O( 1 ). emma 5. The randomized bounded-window sketch RC satisfies Property P. Proof. (Sketch) We need to show that RC k+1,,δ(m, k ) can be constructed using RC k,,δ(m, k ). We can easily show that if RC k,,δ samples one element every r elements, RC k+1,,δ samples one element every r+1 elements. The construction of RC k+1,,δ(m, k ) from RC k,,δ(m, k ) essentially uses the fact that a 1-in- r+1 sample can be obtained from a 1-in- r sample. The theorems below follow directly from the general technique for constructing unbounded-window sketches and the space requirements for our bounded-window sketches. Theorem 5. There exists an unbounded-window sketch V for computing -approximate counts, such that V (m, ) requires O( 1 log 1 log ) space. Theorem 6. There exists an unbounded-window sketch V for computing -approximate quantiles, such that V (m, ) requires O( 1 log 1 log log ) space. Theorem 7. There exists an randomized, unboundedwindow sketch V,δ for computing -approximate counts with probability at least (1 δ), such that V,δ (m, ) requires O( 1 log(δ) 1 log ) space.

10 Theorem 8. There exists an randomized, unboundedwindow sketch V,δ for computing -approximate quantiles with probability at least (1 δ), such that V,δ (m, ) requires O( 1 log 1 log( 1 log 1 ) log ) space, where = (δ)/ log(1/). 8. EXTESIOS TO OTHER TYPES OF WIDOWS As we indicated in Section 1, fixed- and variable-size windows capture the essential features of many common types of windows. Tuple-based windows [] correspond exactly to fixed-size windows. A time-based window [] is defined for streams whose elements have associated timestamps, indicating their arrival time on the stream. At any given point in time, a time-based window contains the elements of a stream with timestamps in the last T time units, for some fixed parameter T. Since zero, one, or more than one elements can have the same timestamp, the number of elements in a time-based windows can vary. Therefore time-based windows are closely related to variable-size windows 5. All of our unbounded-window can be extended to handle time-based windows. Briefly, this is done by storing for each block the timestamp of the oldest element that belongs to the block. (Recall that all of our unbounded-window sketches are constructed using boundedwindow sketches, which in turn maintain some summary information at the level of blocks, for some definition of blocks.) We omit the details, which are straightforward. in et al [17] introduce the notion of n-of- window model. An algorithm maintaining statistics in the n-of- window model should be able to compute approximate statistics over the last n elements of the stream for any n. Any algorithm over variable-size windows necessarily has the ability to compute statistics in the n-of- window, since an adversary for variable-size windows can arbitrarily shrink a window. 9. ACKOWEDGMETS This work was supported by an SRC grant and by ational Science Foundation grants IIS , IIS , and EIA The authors thank Mayank Bawa for useful feedback on an initial draft. 10. REFERECES [1] R. Agrawal, T. Imielinski, and A.. Swami. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD Intl. Conf. on Management of Data, pages 07 16, May [] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of the 1st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pages 1 16, June 00. [3] B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In Proc. 5 There exist statistics, e.g., the count of all elements in the window, that can be computed using constant space over variable-size windows, but not over time-based windows. of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms, pages , Jan. 00. [] B. Babcock, M. Datar, R. Motwani, and. O Callaghan. Maintaining variance and k-medians over data stream windows. In Proc. of the nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pages 3 3, June 003. [5] M. Blum, R. W. Floyd, V. R. Pratt, R.. Rivest, and R. E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7():8 61, Aug [6] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. of the 9th Intl. Coll. on Automata, anguages and Programming, pages , July 00. [7] G. Cormode and S. Muthukrishnan. What s hot and what s not: tracking most frequent items dynamically. In Proc. of the nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pages , June 003. [8] G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. In ATI 00, pages 9 38, Apr. 00. [9] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 635 6, Jan. 00. [10] E. D. Demaine, A. ópez-ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proc. of the 10th Annual European Symp. on Algorithms, pages , Sept. 00. [11] D. J. DeWitt, J. F. aughton, and D. A. Schneider. Parallel sorting on a shared-nothing architecture using probabilistic splitting. In Proc. of the 1st Intl. Conf. on Parallel and Distributed Information Systems, pages 80 91, Dec [1] R. W. Floyd and R.. Rivest. Expected time bounds for selection. Comm. of the ACM, 18(3):165 17, Mar [13] P. B. Gibbons and Y. Matias. ew sampling-based summary statistics for improving approximate query answers. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 331 3, June [1] P. B. Gibbons and S. Tirthapura. Distributed streams algorithms for sliding windows. In Proc. of the 1th Annual ACM Symp. on Parallel Algs. and Architectures, pages 63 7, Aug. 00. [15] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. of the 001 ACM SIGMOD Intl. Conf. on Management of Data, pages 58 66, May 001. [16] R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. on Database Systems, 8(1):51 55, Mar. 003.

11 [17] X. in, H. u, J. Xu, and J. X. Yu. Continuously maintaining quantile summaries of the most recent elements over a data stream. In Proc. of the 0th Intl. Conf. on Data Engineering, Mar. 00. (to appear). [18] G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. of the 8th Intl. Conf. on Very arge Data Bases, pages , Aug. 00. [19] G. S. Manku, S. Rajagopalan, and B. G. indsay. Approximate medians and other quantiles in one pass and with limited memory. In Proc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 6 35, June [0] G. S. Manku, S. Rajagopalan, and B. G. indsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In Proc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 51 6, June [1] J. Misra and D. Gries. Finding repeated elements. Sci. Comput. Programming, ():13 15, ov [] R. Motwani, J. Widom, et al. Query processing, approximation, and resource management in a data stream management system. In Proc. of the 1st Conf. on Innovative Data Systems Research, pages 5 56, Jan [3] J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, pages , [] M. Paterson. Progress in selection. In Proc. of the 5th Scandinavian Workshop on Algorithm Theory, pages , July [5] I. Pohl. A minimum storage algorithm for computing the median. Technical Report IBM Research Report RC 701 (# 1713), IBM T. J. Watson Center, [6] H. Toivonen. Sampling large databases for association rules. In Proc. of the nd Intl. Conf. on Very arge Data Bases, pages 13 15, Sept [7] J. S. Vitter. Random sampling with a reservoir. ACM Trans. on Mathematical Software, 11(1):37 57, Mar

### Identifying Frequent Items in Sliding Windows over On-Line Packet Streams

Identifying Frequent Items in Sliding Windows over On-Line Packet Streams (Extended Abstract) Lukasz Golab lgolab@uwaterloo.ca David DeHaan dedehaan@uwaterloo.ca Erik D. Demaine Lab. for Comp. Sci. M.I.T.

### Optimal Tracking of Distributed Heavy Hitters and Quantiles

Optimal Tracking of Distributed Heavy Hitters and Quantiles Ke Yi HKUST yike@cse.ust.hk Qin Zhang MADALGO Aarhus University qinzhang@cs.au.dk Abstract We consider the the problem of tracking heavy hitters

### A Note on Maximum Independent Sets in Rectangle Intersection Graphs

A Note on Maximum Independent Sets in Rectangle Intersection Graphs Timothy M. Chan School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada tmchan@uwaterloo.ca September 12,

### Scalable Simple Random Sampling and Stratified Sampling

Xiangrui Meng LinkedIn Corporation, 2029 Stierlin Court, Mountain View, CA 94043, USA ximeng@linkedin.com Abstract Analyzing data sets of billions of records has now become a regular task in many companies

### Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream

Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream Xuemin Lin University of New South Wales Sydney, Australia Jian Xu University of New South Wales Sydney, Australia

### Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks

The 28th International Conference on Distributed Computing Systems Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks Linfeng Zhang and Yong Guan Department of Electrical and

### Summary Data Structures for Massive Data

Summary Data Structures for Massive Data Graham Cormode Department of Computer Science, University of Warwick graham@cormode.org Abstract. Prompted by the need to compute holistic properties of increasingly

### Security in Outsourcing of Association Rule Mining

Security in Outsourcing of Association Rule Mining W. K. Wong The University of Hong Kong wkwong2@cs.hku.hk David W. Cheung The University of Hong Kong dcheung@cs.hku.hk Ben Kao The University of Hong

### Maintaining Variance and k Medians over Data Stream Windows

Maintaining Variance and k Medians over Data Stream Windows Brian Babcock Mayur Datar Rajeev Motwani Liadan O Callaghan Department of Computer Science Stanford University Stanford, CA 94305 {babcock, datar,

### The Relative Worst Order Ratio for On-Line Algorithms

The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### The Conference Call Search Problem in Wireless Networks

The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important

### Approximate Frequency Counts over Data Streams

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Rajeev Motwani Stanford University rajeev@cs.stanford.edu Abstract We present algorithms for

### CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

### MODELING RANDOMNESS IN NETWORK TRAFFIC

MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of

### Mining various patterns in sequential data in an SQL-like manner *

Mining various patterns in sequential data in an SQL-like manner * Marek Wojciechowski Poznan University of Technology, Institute of Computing Science, ul. Piotrowo 3a, 60-965 Poznan, Poland Marek.Wojciechowski@cs.put.poznan.pl

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances Jialin Zhang Tsinghua University zhanggl02@mails.tsinghua.edu.cn Wei Chen Microsoft Research Asia weic@microsoft.com Abstract

### Factoring & Primality

Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

### The Online Set Cover Problem

The Online Set Cover Problem Noga Alon Baruch Awerbuch Yossi Azar Niv Buchbinder Joseph Seffi Naor ABSTRACT Let X = {, 2,..., n} be a ground set of n elements, and let S be a family of subsets of X, S

### Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs

Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs Stavros Athanassopoulos, Ioannis Caragiannis, and Christos Kaklamanis Research Academic Computer Technology Institute

### Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets

Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets Gurmeet Singh Manku IBM Almaden Research Center manku@almaden.ibm.com Sridhar Rajagopalan IBM Almaden

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Load Shedding for Aggregation Queries over Data Streams

Load Shedding for Aggregation Queries over Data Streams Brian Babcock Mayur Datar Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA 94305 {babcock, datar, rajeev}@cs.stanford.edu

### Scheduling Real-time Tasks: Algorithms and Complexity

Scheduling Real-time Tasks: Algorithms and Complexity Sanjoy Baruah The University of North Carolina at Chapel Hill Email: baruah@cs.unc.edu Joël Goossens Université Libre de Bruxelles Email: joel.goossens@ulb.ac.be

### 4.4 The recursion-tree method

4.4 The recursion-tree method Let us see how a recursion tree would provide a good guess for the recurrence = 3 4 ) Start by nding an upper bound Floors and ceilings usually do not matter when solving

### CS 840 Fall 2016 Self Organizing Linear Search

CS 840 Fall 2016 Self Organizing Linear Search We will consider searching under the comparison model Binary tree lg n upper and lower bounds This also holds in average case Updates also in O(lg n) Linear

### Load Balancing and Switch Scheduling

EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

### The Classes P and NP

The Classes P and NP We now shift gears slightly and restrict our attention to the examination of two families of problems which are very important to computer scientists. These families constitute the

### Cloud Storage and Online Bin Packing

Cloud Storage and Online Bin Packing Doina Bein, Wolfgang Bein, and Swathi Venigella Abstract We study the problem of allocating memory of servers in a data center based on online requests for storage.

### COLORED GRAPHS AND THEIR PROPERTIES

COLORED GRAPHS AND THEIR PROPERTIES BEN STEVENS 1. Introduction This paper is concerned with the upper bound on the chromatic number for graphs of maximum vertex degree under three different sets of coloring

### 1.1 Related work For non-preemptive scheduling on uniformly related machines the rst algorithm with a constant competitive ratio was given by Aspnes e

A Lower Bound for On-Line Scheduling on Uniformly Related Machines Leah Epstein Jir Sgall September 23, 1999 lea@math.tau.ac.il, Department of Computer Science, Tel-Aviv University, Israel; sgall@math.cas.cz,

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),

### Finding and counting given length cycles

Finding and counting given length cycles Noga Alon Raphael Yuster Uri Zwick Abstract We present an assortment of methods for finding and counting simple cycles of a given length in directed and undirected

### arxiv:1112.0829v1 [math.pr] 5 Dec 2011

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

### Divide-and-Conquer Algorithms Part Four

Divide-and-Conquer Algorithms Part Four Announcements Problem Set 2 due right now. Can submit by Monday at 2:15PM using one late period. Problem Set 3 out, due July 22. Play around with divide-and-conquer

### princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora Scribe: One of the running themes in this course is the notion of

### Frequency Estimation of Internet Packet Streams with Limited Space

Frequency Estimation of Internet Packet Streams with Limited Space Erik D. Demaine Alejandro López-Ortiz J. Ian Munro Abstract We consider a router on the Internet analyzing the statistical properties

### Fundamentals of Analyzing and Mining Data Streams

Fundamentals of Analyzing and Mining Data Streams Graham Cormode graham@research.att.com Outline 1. Streaming summaries, sketches and samples Motivating examples, applications and models Random sampling:

### Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse

### résumé de flux de données

résumé de flux de données CLEROT Fabrice fabrice.clerot@orange-ftgroup.com Orange Labs data streams, why bother? massive data is the talk of the town... data streams, why bother? service platform production

### Dynamic 3-sided Planar Range Queries with Expected Doubly Logarithmic Time

Dynamic 3-sided Planar Range Queries with Expected Doubly Logarithmic Time Gerth Stølting Brodal 1, Alexis C. Kaporis 2, Spyros Sioutas 3, Konstantinos Tsakalidis 1, Kostas Tsichlas 4 1 MADALGO, Department

### Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Universal hashing No matter how we choose our hash function, it is always possible to devise a set of keys that will hash to the same slot, making the hash scheme perform poorly. To circumvent this, we

### 2 Exercises and Solutions

2 Exercises and Solutions Most of the exercises below have solutions but you should try first to solve them. Each subsection with solutions is after the corresponding subsection with exercises. 2.1 Sorting

### Lecture 17 : Equivalence and Order Relations DRAFT

CS/Math 240: Introduction to Discrete Mathematics 3/31/2011 Lecture 17 : Equivalence and Order Relations Instructor: Dieter van Melkebeek Scribe: Dalibor Zelený DRAFT Last lecture we introduced the notion

### Exponential time algorithms for graph coloring

Exponential time algorithms for graph coloring Uriel Feige Lecture notes, March 14, 2011 1 Introduction Let [n] denote the set {1,..., k}. A k-labeling of vertices of a graph G(V, E) is a function V [k].

### Can you design an algorithm that searches for the maximum value in the list?

The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We

### Data Streams A Tutorial

Data Streams A Tutorial Nicole Schweikardt Goethe-Universität Frankfurt am Main DEIS 10: GI-Dagstuhl Seminar on Data Exchange, Integration, and Streams Schloss Dagstuhl, November 8, 2010 Data Streams Situation:

### Mean Ramsey-Turán numbers

Mean Ramsey-Turán numbers Raphael Yuster Department of Mathematics University of Haifa at Oranim Tivon 36006, Israel Abstract A ρ-mean coloring of a graph is a coloring of the edges such that the average

### Keywords: Regression testing, database applications, and impact analysis. Abstract. 1 Introduction

Regression Testing of Database Applications Bassel Daou, Ramzi A. Haraty, Nash at Mansour Lebanese American University P.O. Box 13-5053 Beirut, Lebanon Email: rharaty, nmansour@lau.edu.lb Keywords: Regression

### Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

### CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Frequent Items in Streaming Data: An Experimental Evaluation of the State-of-the-Art

Frequent Items in Streaming Data: An Experimental Evaluation of the State-of-the-Art Nishad Manerikar University of Trento nd.manerikar@studenti.unitn.it Themis Palpanas University of Trento themis@disi.unitn.eu

### WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?

WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT? introduction Many students seem to have trouble with the notion of a mathematical proof. People that come to a course like Math 216, who certainly

### Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT

### Lecture 4 Online and streaming algorithms for clustering

CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

### Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

### Auctions for Digital Goods Review

Auctions for Digital Goods Review Rattapon Limprasittiporn Computer Science Department University of California Davis rlim@ucdavisedu Abstract The emergence of the Internet brings a large change in the

### JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

Scientiae Mathematicae Japonicae Online, Vol. 10, (2004), 431 437 431 JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS Ondřej Čepeka and Shao Chin Sung b Received December May 12, 2003; revised February

### I/O-Efficient Spatial Data Structures for Range Queries

I/O-Efficient Spatial Data Structures for Range Queries Lars Arge Kasper Green Larsen MADALGO, Department of Computer Science, Aarhus University, Denmark E-mail: large@madalgo.au.dk,larsen@madalgo.au.dk

### Regular Languages and Finite Automata

Regular Languages and Finite Automata 1 Introduction Hing Leung Department of Computer Science New Mexico State University Sep 16, 2010 In 1943, McCulloch and Pitts [4] published a pioneering work on a

### Two General Methods to Reduce Delay and Change of Enumeration Algorithms

ISSN 1346-5597 NII Technical Report Two General Methods to Reduce Delay and Change of Enumeration Algorithms Takeaki Uno NII-2003-004E Apr.2003 Two General Methods to Reduce Delay and Change of Enumeration

### Class constrained bin covering

Class constrained bin covering Leah Epstein Csanád Imreh Asaf Levin Abstract We study the following variant of the bin covering problem. We are given a set of unit sized items, where each item has a color

### The Open University s repository of research publications and other research outputs

Open Research Online The Open University s repository of research publications and other research outputs The degree-diameter problem for circulant graphs of degree 8 and 9 Journal Article How to cite:

### Lecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture - PRGs for one time pads

CS 7880 Graduate Cryptography October 15, 2015 Lecture 10: CPA Encryption, MACs, Hash Functions Lecturer: Daniel Wichs Scribe: Matthew Dippel 1 Topic Covered Chosen plaintext attack model of security MACs

### An example of a computable

An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

### The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints

### Discovery of Frequent Episodes in Event Sequences

Data Mining and Knowledge Discovery 1, 259 289 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Discovery of Frequent Episodes in Event Sequences HEIKKI MANNILA heikki.mannila@cs.helsinki.fi

### On the k-path cover problem for cacti

On the k-path cover problem for cacti Zemin Jin and Xueliang Li Center for Combinatorics and LPMC Nankai University Tianjin 300071, P.R. China zeminjin@eyou.com, x.li@eyou.com Abstract In this paper we

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Equilibria in Online Games

Equilibria in Online Games Roee Engelberg Joseph (Seffi) Naor Abstract We initiate the study of scenarios that combine online decision making with interaction between noncooperative agents To this end

### MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan

### Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Yong Zhang 1.2, Francis Y.L. Chin 2, and Hing-Fung Ting 2 1 College of Mathematics and Computer Science, Hebei University,

### International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013

FACTORING CRYPTOSYSTEM MODULI WHEN THE CO-FACTORS DIFFERENCE IS BOUNDED Omar Akchiche 1 and Omar Khadir 2 1,2 Laboratory of Mathematics, Cryptography and Mechanics, Fstm, University of Hassan II Mohammedia-Casablanca,

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### An Efficient Algorithm for Measuring Medium- to Large-sized Flows in Network Traffic

An Efficient Algorithm for Measuring Medium- to Large-sized Flows in Network Traffic Ashwin Lall Georgia Inst. of Technology Mitsunori Ogihara University of Miami Jun (Jim) Xu Georgia Inst. of Technology

### SECTION 2.5: FINDING ZEROS OF POLYNOMIAL FUNCTIONS

SECTION 2.5: FINDING ZEROS OF POLYNOMIAL FUNCTIONS Assume f ( x) is a nonconstant polynomial with real coefficients written in standard form. PART A: TECHNIQUES WE HAVE ALREADY SEEN Refer to: Notes 1.31

### Sublinear Algorithms for Big Data. Part 4: Random Topics

Sublinear Algorithms for Big Data Part : Random Topics Qin Zhang 1-1 Topic 3: Random sampling in distributed data streams (based on a paper with ormode, Muthukrishnan and Yi, PODS 10, JAM 12) 2-1 Distributed

### THE SCHEDULING OF MAINTENANCE SERVICE

THE SCHEDULING OF MAINTENANCE SERVICE Shoshana Anily Celia A. Glass Refael Hassin Abstract We study a discrete problem of scheduling activities of several types under the constraint that at most a single

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### Lecture 1: Elementary Number Theory

Lecture 1: Elementary Number Theory The integers are the simplest and most fundamental objects in discrete mathematics. All calculations by computers are based on the arithmetical operations with integers

### 1 Approximating Set Cover

CS 05: Algorithms (Grad) Feb 2-24, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the set-covering problem consists of a finite set X and a family F of subset of X, such that every elemennt

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### On strong fairness in UNITY

On strong fairness in UNITY H.P.Gumm, D.Zhukov Fachbereich Mathematik und Informatik Philipps Universität Marburg {gumm,shukov}@mathematik.uni-marburg.de Abstract. In [6] Tsay and Bagrodia present a correct

### Multimedia Communications. Huffman Coding

Multimedia Communications Huffman Coding Optimal codes Suppose that a i -> w i C + is an encoding scheme for a source alphabet A={a 1,, a N }. Suppose that the source letter a 1,, a N occur with relative

### Why? A central concept in Computer Science. Algorithms are ubiquitous.

Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

### Duplicate Detection in Click Streams

Duplicate Detection in Click Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science, University of California at Santa Barbara Santa Barbara CA 93106 {metwally, agrawal,

### Week 5 Integral Polyhedra

Week 5 Integral Polyhedra We have seen some examples 1 of linear programming formulation that are integral, meaning that every basic feasible solution is an integral vector. This week we develop a theory

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### Dynamic TCP Acknowledgement: Penalizing Long Delays

Dynamic TCP Acknowledgement: Penalizing Long Delays Karousatou Christina Network Algorithms June 8, 2010 Karousatou Christina (Network Algorithms) Dynamic TCP Acknowledgement June 8, 2010 1 / 63 Layout

### Chapters 1 and 5.2 of GT

Subject 2 Spring 2014 Growth of Functions and Recurrence Relations Handouts 3 and 4 contain more examples;the Formulae Collection handout might also help. Chapters 1 and 5.2 of GT Disclaimer: These abbreviated

### Partitioning edge-coloured complete graphs into monochromatic cycles and paths

arxiv:1205.5492v1 [math.co] 24 May 2012 Partitioning edge-coloured complete graphs into monochromatic cycles and paths Alexey Pokrovskiy Departement of Mathematics, London School of Economics and Political

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### An Evaluation of Self-adjusting Binary Search Tree Techniques

SOFTWARE PRACTICE AND EXPERIENCE, VOL. 23(4), 369 382 (APRIL 1993) An Evaluation of Self-adjusting Binary Search Tree Techniques jim bell and gopal gupta Department of Computer Science, James Cook University,