Sampling Time-Based Sliding Windows in Bounded Space

Transcription

1 Sampling Time-Based Sliding Windows in Bounded Space Rainer Gemulla Technische Universiä Dresden Dresden, Germany Wolfgang Lehner Technische Universiä Dresden Dresden, Germany ABSTRACT Random sampling is an appealing approach o build synopses of large daa sreams because random samples can be used for a broad specrum of analyical asks. Users are ofen ineresed in analyzing only he mos recen fracion of he daa sream in order o avoid oudaed resuls. In his paper, we focus on sampling schemes ha sample from a sliding window over a recen ime inerval; such windows are a popular and highly comprehensible mehod o model recency. In his seing, he main challenge is o guaranee an upper bound on he space consumpion of he sample while using he alloed space efficienly a he same ime. The difficuly arises from he fac ha he number of iems in he window is unknown in advance and may vary significanly over ime, so ha he sampling fracion has o be adjused dynamically. We consider uniform sampling schemes, which produce each sample of he same size wih equal probabiliy, and sraified sampling schemes, in which he window is divided ino smaller sraa and a uniform sample is mainained per sraum. For uniform sampling, we prove ha i is impossible o guaranee a minimum sample size in bounded space. We hen inroduce a novel sampling scheme called bounded prioriy sampling (BPS), which requires only bounded space. We derive a lower bound on he expeced sample size and show ha BPS quickly adaps o changing daa raes. For sraified sampling, we propose a mergebased sraificaion scheme (MBS), which mainains sraa of approximaely equal size. Compared o naive sraificaion, MBS has he advanage ha he sample is evenly disribued across he window, so ha no par of he window is over- or underrepresened. We conclude he paper wih a feasibiliy sudy of our algorihms on large real-world daases. Caegories and Subjec Descripors H.2 [Daabase Managemen]: Miscellaneous; G.3 [Probabiliy and Saisics]: Probabilisic algorihms Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. To copy oherwise, o republish, o pos on servers or o redisribue o liss, requires prior specific permission and/or a fee. SIGMOD 08, June 9 12, 2008, Vancouver, BC, Canada. Copyrigh 2008 ACM /08/06...$5.00. General Terms Algorihms,heory 1. INTRODUCTION Random sampling echniques are a he hear of every daa sream managemen sysem. In fac, i is ofen infeasible o process and/or sore he enire daa sream in high-speed applicaions like monioring of sensor daa, nework raffic, or ransacion logs. Random sampling is an appealing approach o build synopses of large daa sreams, since mos analyical asks can be execued on a sample eiher direcly or in a slighly modified fashion. For example, random samples can be used o esimae sums, averages and quaniles, bu hey also suppor complex daa mining asks such as clusering. The main challenge in daa sream sampling is o mainain samples ha represen only he par of he daa sream relevan for analysis. Since many analyical asks focus on only a recen par of he daa sream, i is ofen unnecessary or infeasible o mainain a sample of he daa sream in is enirey [1]. Curren research addresses his problem by mainaining so-called sequence-based (SB) samples. In an SB sampling scheme, he probabiliy of an iem being included in he sample depends only on is posiion in he daa sream. Popular varians are he mainenance of a uniform sample of he k mos recen iems [2, 9] or biased sampling schemes in which he inclusion probabiliy of an iem decays as new iems arrive [9, 1]. In general, SB schemes provide for efficien sample mainenance in bounded space. Their disadvanage, however, is ha hey are no well-suied for ime-based analysis. To see his, consider he following simple CQL query: SELECT SUM(size) AS num_byes FROM packes [Range 60 Minues] This query moniors he number of byes observed in he packes sream during he las 60 minues. Suppose ha we wan o answer he above query in a coninuous fashion by mainaining an SB sample of he k mos recen iems. Since he sample is sequence-based and he query is imebased, we have o choose k in such a way ha he las k iems are guaraneed o compleely cover he 60-minue range of he query. Clearly, such a choice is impossible if no a-priori knowledge abou he daa sream is available. Bu even if we can come up wih an upper bound for he number of iems in he query range, SB schemes may perform poorly in pracice. The reason is ha unless he daa sream rae is roughly consan he average window size is much smaller han he

2 upper bound, so ha (wih high probabiliy) he sample conains a large fracion of oudaed iems no relevan for he query. The above problems are addressed by ime-based (TB) sampling schemes [2, 9]. In such a scheme, he inclusion probabiliy of an iem depends on is imesamp raher han on is posiion in he daa sream. In his paper, we focus on sampling schemes ha mainain a random sample of a ime-based sliding window. For example, a random sample of he packes arrived during he las 60 minues can direcly be used o approximaely answer he query above. In TB sampling, he main challenge is o realize an upper bound on he space consumpion of he sample while using he alloed space efficienly a he same ime. Space bounds are crucial in daa sream managemen sysems wih many samples mainained in parallel, since hey grealy simplify he ask of memory managemen and avoid unexpeced memory shorcomings a runime. The difficuly of TB sampling arises from he fac ha he number of iems in he window may vary significanly over ime. If he sampling fracion is kep consan a all imes, he sample shrinks and grows wih he number of iems in he window. The size of he sample is herefore unsable and unbounded. To avoid such behavior, he sampling fracion has o be adaped on-hefly. Inuiively, we can afford a large sampling fracion a low sream raes bu only a low sampling fracion a high raes. Uniform sampling schemes, which produce each sample of he same size wih equal probabiliy, are he mos general of he available sampling schemes. Uniform samples are well undersood and, in fac, many saisical esimaors require an underlying uniform sample. In he conex of daa sream sysems, uniform sampling is applied eiher o he daa sources direcly or o he oupu of some of he operaors in he query graph. I is used o reduce or bound resource consumpion, o suppor ad-hoc querying, o analyze why he sysem produced a specific oupu or o opimize he query graph. Though more efficien echniques exis for some of hese applicaions, uniformiy is a mus if knowledge abou he inended use of he sample is no available a he ime he sample is creaed. This is because uniform samples are no ailored o a specific se of queries bu provide a versaile synopsis of he underlying daa. In his paper, we are concerned wih sampling schemes ha mainain a uniform sample of a ime-based sliding window in bounded space. We show ha any such sampling scheme canno provide hard sample size guaranees. Thus, any bounded-space uniform scheme produces samples of variable size, and sample size guaranees are, if available, probabilisic. We hen inroduce a novel uniform sampling scheme called bounded prioriy sampling (BPS). The scheme is based on prioriy sampling [2] bu requires only bounded space. To he bes of our knowledge, BPS is he firs bounded-space uniform sampling scheme for ime-based sliding windows. We analyze he disribuion of he sample size and show ha he algorihm quickly adaps o changing daa raes. By leveraging resuls from he area of disincvalue esimaion [3], we also show how he number of iems in he window can be esimaed from he sample. For applicaions where he uniformiy requiremen is no crucial, we inroduce a more space-efficien sraified sampling scheme [9]. In sraified sampling, he window is pariioned ino non-overlapping ime inervals called sraa. For each sraum, a uniform sample is mainained. Sraified samples are easier o mainain han uniform samples, bu esimaion becomes more involved. For example, i is no known how o esimae he number of disinc values of an aribue from a sraified sample. If sraified samples can be used, however, esimaes may become more precise [9]. In his paper, we discuss he problem of boh placing sraum boundaries and mainaining he corresponding samples. We develop a merge-based sraified scheme (MBS), which mainains sraa of approximaely equal size. The algorihm merges adjacen sraa from ime o ime; he decision of when o merge and which sraa o merge is a major conribuion of his paper. In our soluion, we rea he problem as an opimizaion problem and give a dynamic programming algorihm o deermine he opimum sraum boundaries. The remainder of he paper is srucured as follows. In Secion 2, we review exising echniques from he daabase and daa sream lieraure. In Secion 3, we discuss he problem of mainaining a uniform sample of a ime-based sliding window and give he deails of he BPS algorihm. Secion 4 inroduces merge-based sraificaion, which is hen used o mainain sraa of approximaely equal size. We presen he resuls of our experimenal evaluaion in Secion 5 and conclude he paper in Secion EXISTING TECHNIQUES In his secion, we review exising sampling echniques and discuss heir applicabiliy for he seing of ime-based sliding windows. Since he focus of his paper is on boundedspace sampling schemes, we also analyze he sample size and space consumpion of he available schemes. Daabase sampling. A variey of sample mainenance echniques have been proposed in he conex of relaional daabase sysems. The mos popular daabase sampling echnique is he reservoir sampling scheme [15]. The scheme mainains a uniform random sample of size k of an inseriononly daase by inerceping inserion requess on heir way o he daase. The idea is o add he firs k insered iems direcly o he sample. Subsequen iems are acceped ino he sample wih probabiliy k/(n +1), where N is he number of iems processed so far, or ignored oherwise. Acceped iems replace a sample iem chosen uniformly a random. Reservoir sampling has been exended o suppor updaes and deleions [8, 7], so ha one migh consider using i for ime-based sliding windows. The idea is o rea each arrival as an inserion ino he window and each expiraion as a deleion from he window. This approach does no work, however, since deleions are explici in relaional daabases bu implici in ime-based sliding windows. Tha means ha he daabase sampling schemes require o be aware of every deleion wheher he deleed iem is sampled or no, while a window sampling scheme only observes he expiraion of sampled iems. For his reason, none of he available daabase sampling schemes can be applied o sliding windows. Sequence-based sampling. In a sequence-based sampling scheme, he probabiliy of an iem being included in he sample depends only on is posiion in he daa sream. Babcock e al. [2] discuss several sampling schemes ha mainain a uniform random sample of he las N elemens of he sream. In [9], a sraified sampling scheme for he same purpose is given. The idea is o pariion he daa sream ino a se of equally-sized sraa and o mainain a reservoir sample

3 of each non-expired sraum. In Secion 4, we apply his idea o ime-based sliding windows; he key difference o [9] is ha he deerminaion of sraum boundaries becomes much harder. An alernaive approach o focus aenion on he recen iems is o mainain a biased sample [9, 1]. In hese schemes, he probabiliy of an iem being sampled decays as new iems arrive. Again, his sequence-based noion of recency does no mach he ime-based noion of analysis, so ha sequence-based schemes can only be used if a-priori knowledge abou he sream is available. Bernoulli sampling. In [2], a modified version of Bernoulli sampling, which mainains a uniform sample of a sequencebased window, has been proposed. The mehod can easily be adaped o he seing of ime-based windows. Le q (0, 1) be he desired sampling rae. In he adaped scheme, each iem is included ino he sample wih probabiliy q independen of he oher iems and excluded wih probabiliy 1 q. Iems are removed from he sample if and only if hey expire. Suppose ha a some arbirary poin in ime, he sliding window conains N iems. Then, he expeced sample size is qn and he acual sample size is close o qn wih high probabiliy. The size of he sample herefore grows and shrinks wih he number of iems in he sliding window. One migh hope ha i is possible o decrease (increase) q dynamically whenever he sample size ges oo large (oo small). However, i has been shown recenly ha such a modificaion of q desroys he uniformiy of he sample [7], so ha i is impossible o conrol he size of a Bernoulli sample. Prioriy sampling. The prioriy sampling scheme [2] mainains a uniform sample of size 1 from a ime-based sliding window. Larger samples can be obained by running muliple prioriy samplers in parallel. The idea is o assign a random prioriy beween 0 and 1 o each arriving iem. A any ime, he algorihm repors he iem wih highes prioriy in he window as he sample iem. Since each iem has he same probabiliy of having he highes prioriy, he scheme is indeed uniform. 1 In order o be able o always repor he highes-prioriy iem, i is boh necessary and sufficien o sore he iems for which here is no elemen wih boh a larger imesamp and a higher prioriy. If, as above, he window conains N iems a an arbirary poin in ime, he scheme requires O(log N) space in expecaion, he acual space requiremen is also O(log N) wih high probabiliy [2]. Thus, he space consumpion of prioriy sampling canno be bounded from above. To summarize, none of he available sampling schemes can be used o mainain a random sample from a ime-based sliding window in bounded space. We address his siuaion and inroduce boh a uniform and a sraified bounded-space sampling scheme. 3. UNIFORM SAMPLING We model a daa sream as an infinie sequence R = (e 1, e 2,...) of iems. Each iem e i has he form ( i, d i), where i R denoes a imesamp and d i D denoes he daa associaed wih he iem. The daa domain D depends on he applicaion; for example, D migh correspond o a finie se of IP addresses or an infinie se of readings from one or more sensors. Throughou he paper, we assume ha i < j for i < j, ha is, he imesamps of he iems are 1 This concep is also known as min-wise sampling [11]. sricly increasing. 2 Denoe by R() he se of iems from R wih a imesamp smaller han or equal o. Denoe by W () = R() \ R( ) a sliding window of lengh and denoe by N () = W () he size of he window a ime. For breviy, we will supress he subscrip in he following. Noe ha we use he erm window lengh o refer o he imespan covered by he window (, fixed) and he erm window size o refer o he number of iems in he window (N(), varying). In his secion, we sudy he problem of mainaining a uniform random sample from W () in bounded space. We consider sampling schemes ha mainain a daa srucure from which a uniform random sample S() of he iems in W () can be exraced a any ime. The disincion beween daa srucure and sample allows o examine he space consumpion and he sample size separaely. A sampling scheme is called uniform if for any A 1, A 2 W () wih A 1 = A 2 he probabiliy P { S() = A 1 } ha he scheme produces A 1 saisfies P { S() = A 1 } = P { S() = A 2 }. Thus, he probabiliy ha a sampling scheme produces A 1 depends only on A 1 and no on is composiion. 3.1 A Negaive Resul One migh hope ha here is a sampling scheme ha is able o mainain a fixed-size uniform sample in bounded space. However, such a scheme does no exis. Theorem 1. Fix some ime and se N = N(). Then, any algorihm ha mainains a fixed-size uniform random sample of size k requires a leas Ω(k log N) space in expecaion. Proof. Le A be an algorihm ha mainains a uniform size-k sample of a ime-based sliding window and denoe by W = { e m+1,..., e m+n } he iems in he window a ime. Furhermore, denoe by j = m+j + he poin in ime when iem e m+j expires, 1 j < N, and se 0 =. Now, consider he case where no new iems arrive in he sream unil all he N iems have expired. Then, le I j be a 0/1-random variable and se I j = 1 if he sample repored by A a ime j conains iem e m+j. Oherwise, se I j = 0. Since A has o sore all iems i evenually repors, i follows ha a ime 0 A sores a leas X = P I j iems. We have o show ha E [ X ] = Ω(k log N). Since A is a uniform sampling scheme, iem e m+1 is repored a ime 0 wih probabiliy k/n. A ime 1, only N 1 iems remain in he window and iem e m+2 is repored wih probabiliy k/(n 1). The argumen can be repeaed unil a ime N k, all he k remaining iems are repored by A. I follows ha ( k/(n j) 0 j < N k P { I j = 1 } = 1 oherwise for 0 j < N. Noe ha only he marginal probabiliies are given in (1); join probabiliies like P { I 1 = 1, I 2 = 1 } 2 The algorihms in his paper also work when i j for i < j, bu we will use he sronger assumpion i < j for exposiory reasons. (1)

4 D D E C C C A B B B B DE C C F F G A B C D E F G H A sample iem B replacemen se candidae iem es iem B F B G B H G a) a) b) a) a) c) Figure 1: Illusraion of PS (above imeline) and BPS (below imeline) depend on he inernals of A. By he lineariy of expeced value, and since E [ I j ] = P { I j = 1 }, we find ha E [ X ] = N 1 X j=0 where H n = P n number. i=1 E [ I j ] = k(h N H k + 1) = Ω(k ln N), 1/i = O(ln n) denoes he nh harmonic I follows direcly ha i is impossible o mainain a fixedsize uniform random sample from a ime-based sliding window in bounded space. By Theorem 1, such mainenance requires expeced space logarihmic o he window size (which is unbounded); he wors-case space consumpion is a leas as large. I is no possible eiher o guaranee a minimum sample size because any algorihm ha guaranees a minimum sample size can be used o mainain a sample of size 1. In he ligh of Theorem 1, we also noe ha he prioriy sampling scheme (see Secion 2) is asympoically opimal in erms of expeced space. However, he algorihm has a muliplicaive overhead of ln N and herefore a low space efficiency. 3.2 Bounding he Space Consumpion We now develop a bounded-space uniform sampling scheme based on prioriy sampling (PS). Recall ha in prioriy sampling, a random prioriy p i chosen uniformly a random from he uniy inerval is associaed wih each iem e i R. The sample S() hen consiss of he iem in W () wih he larges prioriy. In addiion o he sample iem, he scheme sores a se of replacemen iems, which replace he larges-prioriy iem when i expires. This replacemen se consiss of all he iems for which here is no iem wih boh a larger imesamp and a higher prioriy. Figure 1 gives an example of he sampling process. A solid black circle represens he arrival of an iem; is name and prioriy are given below and above, respecively. The verical bars on he imeline indicae he window lengh, iem expiraions are indicaed by whie circles, and double-expiraions 3 are doed whie circles. Above he imeline, he curren sample iem and he se of replacemen iems are shown. I can be seen ha he number of replacemen iems sored by he algorihm varies over ime. In fac, he replacemen se is he reason for he unbounded space consumpion of he sampling scheme: i conains beween 0 and N() 1 iems and roughly ln N() iems on average [2]. 3 An iem ha arrived a ime double-expires a ime +2. G We now describe our bounded-space prioriy sampling (BPS) scheme. The scheme also assigns random prioriies o arriving iems bu sores a mos wo iems in memory: a candidae iem from W () and a es iem from W ( ). The es iem is used o deermine wheher or no he candidae iem is repored as a sample iem, see he discussion below. The mainenance of hese wo iems is as follows: a) Arrival of iem e i. If here is currenly no candidae iem or if he prioriy of e i is larger han he prioriy of he candidae iem, e i becomes he new candidae iem and he old candidae is discarded. Oherwise, he arriving iem is ignored. b) Expiraion of candidae iem. The expired candidae becomes he es iem; we only sore he imesamp and he prioriy of he es iem. There is no candidae iem unil he nex iem arrives in he sream. c) Double-expiraion of es iem. The es iem is discarded. The above algorihm mainains he following invarian: The candidae iem always equals he highes-prioriy iem ha has arrived in he sream since he expiraion of he former candidae iem. This migh or migh no coincide wih he highes-prioriy iem in he curren window and we use he es iem o disinguish beween hese wo cases. Suppose ha a some ime, he candidae iem expires and becomes he es iem. Then he candidae mus have been he highes-prioriy iem in he window righ before is expiraion. (If here were an iem wih a higher prioriy, his iem would have replaced he candidae.) I follows ha whenever he candidae iem has a higher prioriy han he curren es iem, we know ha he candidae is he highesprioriy iem since he arrival of he es iem and herefore since he sar of he curren window. Similarly, whenever here is no es iem sored by BPS, here hasn been an expiraion of a candidae iem for a leas one window lengh, so ha he candidae also equals he highes-prioriy iem in he window. In boh cases, we repor he candidae as a sample iem. Oherwise, if he candidae iem has a lower prioriy han he es iem, we have no means o deec wheher or no he candidae equals he highes-prioriy iem in he window and no sample iem is repored. Before we asser he correcness of BPS and analyze is properies, we give an example of he sampling process in Figure 1. The curren candidae iem and es iem are shown below he imeline. If he candidae iem is shaded, i is repored as a sample iem; oherwise, no sample iem is repored. The leers below he BPS daa srucure refer o cases a), b) and c) above. As long as no expiraion occurs, he candidae sored by BPS equals he highes-prioriy iem in he window and is herefore repored as a sample iem. The siuaion changes as B expires. BPS hen makes iem B he es iem and because here is no candidae iem anymore fails o repor a sample iem. This failure can be seen as a consequence of Theorem 1: BPS is a boundedspace sampling scheme and hus canno guaranee a fixed sample size. Iem F becomes he new candidae iem upon is arrival. However, F is no repored because is prioriy is lower han he prioriy of he es iem B. And in fac, no F bu C is he highes-prioriy iem in he window a his ime. Laer C expires and F does become he highesprioriy iem in he window. However, we sill do no repor

5 F since we are no aware of his siuaion. As G arrives, however, we repor a sample iem again because G has a higher prioriy han he es iem B. Finally, iem B is discarded from he BPS daa srucure as i double-expires. 3.3 Correcness and Analysis We now esablish he correcness of he BPS algorihm. Recall ha BPS produces eiher an empy sample or a single-iem sample. Given ha BPS does produce a sample iem, we have o show ha his iem is chosen uniformly and a random from he iems in he curren sliding window. Theorem 2. BPS is a uniform sampling scheme, ha is, for any e j W (), we have P { S() = { e j } S() = 1 } = 1/N(). Proof. Fix some ime and se S = S(). Denoe by e max he highes-prioriy iem in W () and suppose ha e max has prioriy p max. Furhermore, denoe by e W ( ) he candidae iem sored in he BPS daa srucure a ime (if here is one) and le p be he prioriy of e. Noe ha boh e max and e are random variables. There are 3 cases. Case 1: There is no candidae iem a ime. Then a ime, e max is he candidae iem and here is no es iem. We have S = { e max }. Case 2: Iem e has a smaller prioriy han e max. Then e max is he candidae iem a ime and depending on wheher e expired before or afer he arrival of e max he es iem is eiher equal o e or empy. In boh cases, we have S = { e max }. Case 3: Iem e has a higher prioriy han e max. Then, e is sill he candidae iem a he ime of is expiraion, since here is no higher-prioriy iem in W () ha migh have replaced e. Thus, iem e becomes he es iem upon is expiraion and coninues o be he es iem up o ime i double-expires somewhere in he inerval (, + ). I follows ha no iem is repored a ime so ha S =, because he prioriy of he candidae iem ( p max) is lower han he prioriy p of he es iem. To summarize, we have 8 >< { e max } no candidae iem a ime S = { e max } p max > p (2) >: oherwise. Uniformiy now follows since (2) does no depend on he values, imesamps or order of he individual iems in W (). For any e j W (), we have P { S = { e j } S = 1 } = P { e j = e max } = 1/N() and he heorem follows. We now analyze he sample size of he BPS scheme. Clearly, he sample size is probabilisic and is exac disribuion depends on he enire hisory of he daa sream. However, in he ligh of Theorem 3 below, i becomes eviden ha we can sill provide a local lower bound on he probabiliy ha he scheme produces a sample iem. The lower bound is local because i changes over ime; we canno guaranee a global lower bound oher han 0 ha holds a any arbirary ime wihou a-priori knowledge of he daa sream. Theorem 3. The probabiliy ha BPS succeeds in producing a sample iem a ime is bounded from below by P { S() = 1 } N() N( ) + N(). Proof. BPS produces a sample iem if he highesprioriy iem e max W () has a higher prioriy han he candidae iem e sored in he BPS daa srucure righ before he sar of W (); see (2) above. In he wors case, e equals he highes-prioriy iem in W ( ). Now suppose ha we order he iems in W ( ) W () in descending order of heir prioriies. BPS succeeds for sure if he firs of he ordered iems is an elemen of W (). Since he prioriies are independen and idenically disribued, his even occurs wih probabiliy N()/(N( ) + N()) and he asserion of he heorem follows. If he arrival rae of he iems in he daa sream is consan so ha N() = N( ), BPS succeeds wih probabiliy of a leas 50%. If he rae increases or decreases, he success probabiliy will also increase or decrease, respecively. 3.4 Sampling Muliple Iems The BPS scheme as given above can be used o mainain a single-iem sample. A sraighforward way o obain larger samples is o run k independen BPS samplers S 1,..., S k in parallel; we refer o his scheme as BPS wih replacemen (BPSWR). The sample is hen se o S = S 1 S k. We have E [ S ] = kx N() P { S i = 1 } k N( ) + N() i=1 by he lineariy of he expeced value. However, his approach has wo major drawbacks. Firs, he sample S is a wih-replacemen sample, ha is, each iem in he window may be sampled more han once. The ne sample size afer duplicae removal migh herefore be smaller han S. Second and more imporanly, he mainenance of he k independen samples is expensive. Since a single copy of he BPS daa srucure requires consan ime per arriving iem, he per-iem processing ime is O(k) and he oal ime o process a window of size N is O(kN). If k is large, he overhead o mainain he sample can be significan. We now develop a wihou-replacemen sampling scheme called BPSWOR. In general, wihou-replacemen samples are preferable since hey conain more informaion abou he daa. The scheme is as follows: we modify BPS so as o sore k candidaes and k es iems simulaneously. Denoe by S cand he se of candidaes and by S es he se of es iems. The sampling process is similar o BPS: An arriving iem e becomes a candidae when eiher S cand < k or e has a higher prioriy han he lowes-prioriy iem in S cand. In he laer case, he lowes-prioriy iem is discarded in favor of e. As before, expiring candidaes become es iems and double-expiring es iems are discarded. The sample S() is hen given by S() = op-k `S cand () S es() S cand (), where op-k(a) deermines he iems in A wih he k highes prioriies. Noe ha for k = 1, BPSWR and BPSWOR coincide. S() is hen a uniform random sample of W () wihou replacemen; he proof is similar o he proof of

6 Theorem 2. Also, using an argumen as in he proof of Theorem 3, we can show ha E [ S() ] kn()/(n( ) + N()). Thus, BPSWR and BPSWOR have he same lower bound on he expeced (gross) sample size. The cos of processing a window of size N is O(kN) if he candidaes are sored in a simple array. A more efficien approach which also improves he cos in comparison o BPSWR is o sore he candidaes in a reap, where he iems are arranged in order wih respec o he imesamps and in heap-order wih respec o he prioriies. The expeced cos of BPSWOR hen decreases o O(N + k log k log N) in expecaion. 4 Noe ha we can also modify PS o sample wihou replacemen. The so-modified PSWOR scheme hen repors he iems wih he k highes prioriies in he window. In order o mainain hese k iems incremenally, we sore each iem as long as here are fewer han k more recen iems wih a higher prioriy. The space consumpion is sill O(k log N) in expecaion, bu efficien mainenance of he replacemen se becomes challenging. Since he focus of his paper is on bounded-space sampling schemes, we do no furher elaborae on his issue. 3.5 Esimaion of Window Size For some applicaions, i is imporan o be able o esimae he window size in order o make effecive use of he sample. For example, he window sum of an aribue is ypically esimaed as he sample average of he respecive aribue muliplied by he window size. Thus in some applicaions knowledge of he window size is imporan o deermine scale-up facors. Exac mainenance of he number of iems in he window requires ha we sore all he imesamps in he window in order o deal wih expiraions. Typically, his approach is infeasible in pracice. Approximae daa srucures [6] do exis and can be leveraged o suppor he sampling process. If such alernae daa srucures are unavailable, we can come up wih an esimae of he window size direcly from he sample. Se W 2() = W ( ) W () and denoe by p (k) he prioriy of he iem wih he kh highes prioriy in W 2(). In [3], i has been shown ha an unbiased esimaor for N() is given by ˆN W () = W () op-k W2() k k 1. 1 p (k) Here, he firs facor esimaes he fracion of non-expired iems in W 2() from he op-k iems (which can be viewed as a random sample of W 2), while he second facor is an esimae of W 2() iself. Now, suppose ha we mainain he sample using BPSWOR. Se S 2() = S cand S es and denoe by p (k) he prioriy of he iem wih he kh highes prioriy in S 2. Consider he esimaor ˆN S() = S() op-k S2() k k 1. 1 p (k) This esimaor is similar o ˆN W () bu solely accesses informaion available in he sample. Boh esimaors coincide if and only if op-k S 2() = op-k W 2(). This happens if 4 Following an argumen as in [3], a mos O(k log N) iems of he window are acceped ino he candidae se in expecaion and each acceped iem incurs an expeced cos of O(log k) [13]. A mos k iems (double-)expire while processing a window, so ha he expeced cos o process (double-)expiraions is O(k log k). a leas W ( ) op-k W 2() iems have been repored as he sample a ime. Oherwise, he firs facor in ˆN S() will overesimae he firs facor in ˆN W (), while he second facor will underesimae he respecive facor in ˆN W (). In our experimens, we found ha he esimaor ˆN S has negligible bias and low variance. Thus, boh overand underesimaion seem o balance smoohly, hough we do no make any formal claims here. 4. STRATIFIED SAMPLING We now consider he problem of mainaining a sraified sample of a ime-based sliding window. The general idea is o pariion he window ino disjoin sraa and o mainain a uniform sample of each sraum [9]. Sraified sampling is ofen superior o uniform sampling because a sraified scheme explois correlaions beween ime and he quaniy of ineres. As will become eviden laer on, sraificaion also allows us o mainain larger samples han wih BPS in he same space. The main drawback of sraified sampling is is limied applicabiliy; for some problems, i is difficul or even impossible o compue a global soluion from he differen subsamples. For example, i is no known how he number of disinc values can be esimaed from a sraified sample, while he problem has been sudied exensively for uniform samples [5]. If, however, he desired analyical asks can be performed on a sraified sample, sraificaion is ofen he mehod of choice. We consider sraified sampling schemes, which pariion he window ino l > 1 sraa and mainain a uniform sample S i of each sraum, 1 i l. Each sample has a fixed size of n iems. In addiion o he sample, we also sore he sraum size N i and he imesamp i of he upper sraum boundary; hese wo quaniies are required for sample mainenance. The main challenge in sraified sampling is he placemen of sraum boundaries because hey have a significan impac on he qualiy of he sample. 5 In he simples version, he sream is divided ino sraa of equal widh (ime inervals); we refer o his sraegy as equi-widh sraificaion. An alernaive sraegy is equi-deph sraificaion, where he window is pariioned ino sraa of equal size (number of iems). Equi-deph sraificaion ouperforms equi-widh sraificaion when he arrival rae of he daa sream varies inside a window, bu he sraa are much more difficul o mainain. In fac, perfec equi-deph sraificaion is impossible (see below), so ha approximae soluions are needed. In his secion, we develop a merge-based sraificaion sraegy, which approximaes equi-deph sraificaion o he bes possible exen. Figure 2 illusraes equi-widh sraificaion wih parameers l = 4 and n = 1; sampled iems are represened by solid black circles. The figure displays a snapsho of he sample a 3 differen poins in ime, which are arranged verically and ermed a), b) and c). Noe ha he righmos sraum ends a he righ window boundary and grows as new iems arrive, while he lefmos sraum exceeds he window and may conain expired iems. The mainenance of he sraified sample is significanly simpler han he mainenance of a uniform sample because arrivals and expiraions are no 5 To see his, consider he simple case where all iems in he window fall ino only one of he l sraa. In his case, a fracion of 100(l 1)/l% of he available space remains unused.

7 a) b) c) window sraum boundary A B C D E F G sampled iem B C D E F G H C D E F G H I JK Figure 2: Equi-widh sraificaion expiraion 1 s sraum inermixed wihin sraa. Arriving iems are added o he righmos sraum and since no expiraions can occur we can use reservoir sampling o mainain he sample incremenally (see Secion 2). On he conrary, expiraions only affec he lefmos sraum. We remove expired iems from he respecive sample; he remaining sample sill represens a uniform sample of he non-expired par of he sraum [8]. 4.1 Effec of Sraum Sizes The main advanage of equi-widh sraificaion is is simpliciy, he main disadvanage is ha he sampling fracion may vary widely across he sraa. In he example of Figure 2c), he sampling fracions of he firs, second and hird sraum are given by 50%, 100% and 16%, respecively. In general, dense regions of he sream are underrepresened by an equi-widh sample, while sparse regions are overrepresened. Thus, we wan o sraify he daa sream in such a way ha each sraum has approximaely he same size and herefore he same sampling fracion; we refer o his approach as equi-deph sraificaion. Unforunaely, perfec equi-deph sraificaion is no realizable in pracice because he daa sream is unknown in advance and we canno move sraum boundaries arbirarily. Before we inroduce our approximae merge-based algorihm, we discuss he relaionship of sraum sizes and accuracy wih he help of a simple example. Suppose ha we wan o esimae he window average µ of some aribue of he sream from a sraified sample and assume for simpliciy ha he respecive aribue is normally disribued wih mean µ and variance σ 2. Furher suppose ha a some ime he window conains N iems and is divided ino l sraa of sizes N 1,..., N l wih P N i = N. Then, he sandard Horviz-Thompson esimaor ˆµ of µ is a weighed average of he per-sraum sample averages [12], P ha is ˆµ = 1 l N i=1 Ni ˆµi, where ˆµi is he sample average of he ih sraum. The esimaor has variance Var [ ˆµ ] = 1 N 2 lx Ni 2 Var [ ˆµ i ] = i=1 σ2 nn 2 lx Ni 2, where we used Var [ ˆµ i ] = σ 2 /n. Thus, he variance of he esimaor is proporional o he sum of he squares of he sraum sizes, or similarly, he variance of he sraum sizes: Var [ N 1,..., N l ] = lx i=1 i=1 N i N «2 P N 2 = i l l «2 N l (3) The variance is minimized if all sraa have he same size (bes case) and maximized if one sraum conains all he iems in he window (wors case). The above example is exremely simplified because we designed he sream in such a way ha he variance Var [ ˆµ i ] of he esimae is equal in all sraa. In general, sraificaion is he more efficien he higher he correlaion of he aribue of ineres wih ime ges (because ime is he sraificaion variable). In his paper, however, we assume ha no informaion abou he inended use of he sample is available; in his case, our bes guess is o assume equal variance in each sraum. Thus, he variance of he sraum sizes as given in (3) can be used o quanify he qualiy of a given sraificaion. 4.2 Merge-Based Sraificaion Perfec equi-deph sraificaion is impossible, since we canno reposiion sraum boundaries arbirarily. To see his, consider he sae of he sample as given in Figure 2c). To achieve equi-deph sraificaion, we would have o (1) remove he sraum boundary beween iems D and E, and (2) inroduce a new sraum boundary beween H and I. Here, (1) represens a merge of he firs and second sraum. In [4], Brown e al. have shown ha such a merge is possible, ha is, a sample of he merged sraum can be compued from he samples of he individual sraa. In he example, he merged sample would conain iem C wih probabiliy 2/3 and iem E wih probabiliy 1/3. In conras, (2) represens a spli of he hird sraum ino wo new sraa, one conaining iems F -H and one conaining iems I-K. In he case of a spli, i is neiher possible o compue he samples of he wo new sraa nor o deermine he sraum sizes. In he example, prior o he spli, he hird sraum has size 6 and he sample conains iem I. Based on his informaion, i is impossible o come up wih a sample of sraum F -H; we canno even deermine ha sraum F -H conains 3 iems. Our merge-based sraified sampling scheme (MBS) approximaes equi-deph sraificaion o he exen possible. The main idea is o merge wo adjacen sraa from ime o ime. Such a merge reduces he informaion sored abou he wo sraa bu creaes free space a he end of he sample, which can be used for fuure iems. In Figure 3, we illusrae MBS on he example daa sream. We sar as before wih he 4 sraa given in a). Righ afer he arrival of iem H, we merge sraum C-D wih sraum E o obain sraum C-E. The decision of when and which sraa o merge is he major challenge of he algorihm. Afer a merge, we use he freed space o sar a new, iniially empy sraum. The sae of he sample afer he creaion of he new sraum is shown in b). Subsequen arrivals are added o he new sraum (iems I, J and K). Finally, sraum A-B expires and, again, a fresh sraum is creaed; see c). Noe ha he sample is much more balanced han wih equi-widh sraificaion (Figure 2). Before we discuss when o merge, we briefly describe how o merge. Suppose ha we wan o merge wo adjacen sraa R 1 and R 2 wih R 1, R 2 n. Denoe by S i, N i, i he uniform sample (of size n), he sraum size and he upper boundary of sraum R i, i { 1, 2 }. Then, he merged sraum R = R 1 R 2 has size N 1 + N 2 and upper boundary 2. In [4], Brown e al. have shown how o merge S 1 and S 2 o obain a uniform sample S of R 1 R 2 wih S = n. Le X be a random variable for he number of iems from

8 a) b) c) A B C D E F G B C D E F G H C D E F G H I JK Figure 3: Merge-based sraificaion hrough merging hrough expiraion R 1 in a size-n uniform sample drawn direcly from R. X is hypergeomerically disribued wih!!! P { X = x } = N1 N 2 N 1 + N 2 x n x n for 0 k n. Since all he disribuion parameers are known, we can obain a realizaion x of X by hrowing a dice. Then, we compue uniform subsamples S 1 and S 2 from S 1 and S 2, respecively, wih S 1 = x and S 2 = n x. The subsamples can be compued using reservoir sampling, hough more efficien sampling schemes exis for his purpose [14]. The final sample S is hen se o he union of S 1 and S 2; see [4] for a proof of he uniformiy of S. 4.3 When To Merge Which Sraa The decision of when and which sraa o merge is crucial for merge-based sraificaion. Suppose ha a some ime, he window is divided ino l sraa R 1,..., R l of size N 1,..., N l, respecively. During he subsequen sampling process, a new sraum is creaed when eiher (1) sraum R 1 expires or (2) wo adjacen sraa are merged. Observe ha we have no influence on (1), bu we can apply (2) as needed. We now rea he problem of when and which sraa o merge as an opimizaion problem, where he opimizaion goal is o minimize he variance of he sraum sizes a he ime of he expiraion of R 1. Therefore whenever he firs sraum expires he sample looks as much like an equi-deph sample as possible. Denoe by R + = { e 1,..., e N + } he se of iems ha arrive unil he expiraion of sraum R 1 (bu have no ye arrived) and se N + = R +. 6 A he ime of R 1 s expiraion and before he creaion of he new sraum, he window is divided ino l 1 sraa so ha here are l 2 inner sraum boundaries. The posiions of he sraum boundaries depend on boh he number and poin in ime of any merges we perform. Our algorihm ress on he observaion ha for any way of puing l 2 sraum boundaries in he sequence R 2, R 3,..., R l, e 1, e 2,..., e N +, here is a leas one corresponding sequence of merges ha resuls in he respecive sraificaion. For example, he sraificaion R 2 R 3 R l, e 1,..., e N + is achieved if no merge is performed (verical bars denoe boundaries), while R 2 R i, R i+1 R l, e 1,..., e j e j+1,..., e N + 6 In pracice, N + is no known in advance; we address his issue in Secion 4.4. is achieved if sraum R i and R i+1 are merged afer he arrival of iem e j and before he arrival of iem e j+1. In general, for every sraum boundary in beween R l, e 1,..., e N +, we drop a sraum boundary in beween R 2,..., R l by performing a merge operaion a he respecive poin in ime. We can now reformulae he opimizaion problem: Find he pariioning of he inegers N 2,..., N l, 1,..., 1 {z } N + imes ino l 1 consecuive and non-empy pariions so ha he variance (or sum of squares) of he inra-pariion sums is minimized. The problem can be solved using dynamic programming in O(l(l+N + ) 2 ) ime [10]. In our specific insance of he problem, however, he las N + values of he sequence of inegers are all equal o 1. As shown below, we can leverage his fac o consruc a dynamic programming algorihm ha obains an opimum soluion in only O(l 3 ) ime. Since N + is ypically large, he improvemen in performance can be significan. The algorihm is as follows. Le op(k, i) be he minimum sum of squares when k of he l 2 boundaries are placed beween N 2,..., N l and he las one of hese k boundaries is placed righ afer N i; 0 k l 2 and k < i < l. Then, op(k, i) can be decomposed ino wo funcions op(k, i) = f(k, i) + g(k, i), where f(k, i) is he minimum sum of squares for he k pariions lef of and including N i and g(k, i) is he minimum sum of squares for he l k 1 pariions righ of N i. The decomposiion significanly reduces he complexiy because he compuaion of g does no involve any opimizaion. To define g(k, i), observe ha by definiion, here are no boundaries in beween N i+1,..., N l, so ha hese values fall ino a single pariion and we can sum hem up. The resuling par of he ineger sequence is hen where N a,b = P b j=a Nj.7 N i+1,l, 1,..., 1, In fac, g is minimized if all he l k 1 pariions have he same size, ha is, size N i+1,l+n +. l k 1 If N i+1,l is larger han his average size, he minimum value of g canno be obained. In his case, he bes choice is o pu N i+1,l in one sraum for is own; he remaining l k 2 N pariions hen all have size +. Thus, he funcion g is l k 2 given by g(k, i) 8 < Ni+1,l +N (l k 1) + 2 N l k 1 i+1,l < N i+1,l+n + l k 1 = : Ni+1,l 2 N + (l k 2) + 2 oherwise. l k 2 The funcion f can be defined recursively wih f(0, i) = N2,i 2 f(k, i) = min f(k 1, j) + N 2 k j<i j+1,i. and he opimum soluion is given by min 0 k l 2 min op(k, i). k<i<l 7 N a,b can be compued in consan ime wih he help of an array conaining he prefix sums N 2,2,..., N 2,l [10].

9 To compue he opimum soluion, we ierae over k in increasing order and memoize he values of f(k, ); hese values will be reused for he compuaion of f(k + 1, ). The global soluion and he corresponding sraum boundaries are racked during he process. Since each of he loop variables k, i and j ake a mos l differen values, he oal ime complexiy is O(l 3 ). The algorihm requires O(l) space. 4.4 Esimaion of Arriving-Iem Coun The decision of when o merge is dependen on he number N + of iems ha arrive unil he expiraion of he firs sraum. In pracice, N + is unknown and has o be esimaed. In his secion, we propose a simple and fas-o-compue esimaor for N +. Especially for bursy daa sreams, esimaion errors can occur; we herefore discuss how o make MBS robus agains esimaion errors. As before, suppose ha a some ime he sample consiss of l sraa of sizes N 1,..., N l and denoe by i he upper boundary of he ih sraum, 1 i l. Furhermore, denoe by = 1 + he ime span unil he expiraion of he firs sraum. We wan o predic he number of iems ha arrive unil ime +. Denoe by j he sraum index such ha j > and j+1. An esimae ˆN + of N + is hen given by ˆN + = P l i=j+1 Ni j. The esimae roughly equals he amoun of iems ha arrived in he las ime unis. The inuiion behind his esimaor is ha he amoun of hisory we use for esimaion depends on how far we wan o exrapolae ino he fuure. In conjuncion wih he robusness echniques discussed below, his approach showed a good performance in our experimens. Whenever a sraum expires, we compue he esimae ˆN + and based on his esimae deermine he opimum sequence of merges using he algorihm given in Secion 4.3. Denoe by ˆm 0 he oal number of merges in he resuling sequence and by ˆN + 1 he number of iems ha arrive before he firs merge. In general, we now wai for ˆN + 1 iems o arrive in he sream and hen perform a merge operaion. Noe ha he value of ˆm ( ˆN + 1 ) is a monoonically increasing (decreasing) funcion of ˆN + ; we perform he more merges he more iems arrive before he expiraion of he firs sraum. Thus, underesimaion may lead o oo few merges and overesimaion may lead o oo many merges. To make MBS robus agains esimaion errors, we recompue he sequence of merges whenever we observe ha he daa sream behaves differenly han prediced. There are wo cases: ˆm = 0: We recompue ˆm and ˆN + 1 only if more han ˆN + iems arrive in he sream, so ha a merge may become profiable. This sraegy is opimal if ˆN + N + bu migh oherwise lead o a ardy merge. ˆm > 0: Denoe by ˆ = ˆN + 1 ˆN + he esimaed ime span unil he arrival of he ˆN + 1 -h iem. We recompue he esimaes if he ˆN + 1 -h iem does no arrive close o ime +ˆ. For concreeness, recompuaion is riggered if eiher he ˆN + 1 -h iem arrives before ime + (1 ɛ)ˆ or when fewer han ˆN + 1 iems arrived a ime +(1+ɛ)ˆ, where 0 < ɛ < 1 deermines he validiy inerval of he esimae and is usually se o a small value, say 5%. In our experimens, he variance of he sraum sizes achieved by MBS wihou a-priori knowledge of N + was almos as low as he one achieved by MBS wih a-priori knowledge of N EXPERIMENTS We implemened bounded-space prioriy sampling wih and wihou replacemen (BPSWR/BPSWOR), prioriy sampling wihou replacemen (PSWOR), Bernoulli sampling and he sraified sampling schemes in Java 1.6. The experimens have been run on a worksaion PC wih a 3 GHz Inel Penium 4 processor and 2.5 GB main memory. Almos all of he experimens have been run on real-world daases because we fel ha synheic daases canno capure he complex disribuion of real-world arrival raes. We used wo real daases, which reflec wo differen ypes of daa sreams frequenly found in pracice. The NETWORK daase, which conains nework raffic daa, has a very bursy arrival rae wih high shor-erm peaks. In conras, he SEARCH daase conains usage saisics of a search engine and he arrival rae changes slowly; i basically depends on he ime of day. These wo daases allowed us o sudy he influence of he evoluion of he arrival raes on he sampling process. The NETWORK daase has been colleced by monioring one of our web servers for a period of 1 monh. The daase conains 8, 430, 904 iems, where each iem represens a TCP packe and consiss of a imesamp (8 byes), a source IP and por (4 + 2 byes), a desinaion IP and por (4 + 2 byes) and he size of he user daa (2 byes). The SEARCH daase has been colleced in a period of 3 monhs and conains 36, 389, 565 iems. Each iem consiss of a imesamp (8 byes) and a user id (4 byes). For mos of our experimens, we do no repor he esimaion error of a specific esimae derived from he sample bu raher give he key characerisics ha influence he esimaion error of any poenial esimae. This way, our resuls are independen of he acual values associaed wih he iems in our daases. In he case of uniform sampling, he key characerisic is he sample size. Two uniform samples of he same size are idenical in disribuion, no maer which scheme has been used o compue hem. Larger samples ineviably lead o a smaller esimaion error. For sraified sampling, he key characerisic is he variance of he sraum sizes. This variance is a direc measure of how close sraificaion is o equi-deph sraificaion. A smaller variance ypically resuls in less esimaion error. 5.1 Summary of Experimenal Resuls For uniform sampling, we found ha: BPSWOR is he mehod of choice when he available memory is limied and he daa sream rae is varying. I hen produces larger samples han Bernoulli sampling or PSWOR. Also, BPSWOR is he only scheme ha does no require a-priori informaion abou he daa sream and guaranees an upper bound on he memory consumpion. The window size raio of he curren window o boh he curren and previous window has a significan impac on he sample size of BPSWOR. A small raio leads o smaller samples, while a large raio resuls in larger samples. For a given raio, he sample size has low variance and is skewed owards larger samples.

10 BPSWOR is superior o BPSWR because i is significanly faser and samples wihou replacemen. The window size esimae discussed in Secion 3.5 has low relaive error. The relaive error decreases wih an increasing sample size. For sraified sampling, we found ha: Merge-based sraificaion leads o significanly lower sraum size variances han equi-widh sraificaion when he daa sream is bursy. Boh schemes have comparable performance when he daa sream rae changes slowly. Merge-based sraificaion seems o be robus o errors in he arrival rae esimae. Resuls wih esimaed arrival raes are close o he heoreical opimum. When he number of sraa is no oo large ( 32), he overhead of merge-based sraificaion is low. 5.2 Uniform Sampling, Synheic Daa In a firs experimen, we compared Bernoulli sampling, PSWOR and BPSWOR. Neiher Bernoulli sampling nor PSWOR can guaranee an upper bound on he space consumpion and wihou a-priori knowledge of he sream i is no possible o paramerize hem o only infrequenly exceed he space bound. The goal of his experimen is o compare he sample size and space consumpion of he hree schemes under he assumpion ha such a paramerizaion is possible. For his purpose, we generaed a synheic daa sream, where each iem of he daa sream consiss of an 8-bye imesamp and 32 byes of dummy daa. To generae he imesamps, we modeled he arrival rae of he sream using a sine curve wih a 24h-period, which akes values beween 3, 000 and 5, 000 iems per hour. We superimposed he probabiliy densiy funcion (PDF) of a normal disribuion wih mean 24 and variance 0.5 on he sine curve; he PDF has been scaled so ha i akes a maximum of 30, 000 iems per hour. This models real-world scenarios where he peak arrival rae (scaled PDF) is much higher han he average arrival rae (sine curve). We used he hree sampling schemes o mainain a sample from a sliding window of 1 hour lengh; he window size over ime is given in Figure 4a. We used a space budge of 32 kbyes; a mos 819 iems can be sored in 32 kbyes space. For he sampling schemes, we used parameers k BPSWOR = 585 (number of candidae/es iems), k PSWOR = 113 (sample size) and q Bernoulli = (sampling rae). The laer wo parameers have been chosen so ha he expeced space consumpion a he peak arrival rae equals 32 kbyes as discussed above, his paramerizaion is only possible because we know he behavior of he sream in advance. During he sampling process, we moniored boh sample size and space consumpion; he resuls are given in Figure 4b and 4c, respecively. Bernoulli sampling. The size of he Bernoulli sample follows he size of he window: I flucuaes around 110 iems in he average case bu says close o he 819 iems a peak imes. The space consumpion of he sample is proporional o he sample size; a large fracion of he available space remains unused in he average case. Prioriy sampling. PSWOR produces a consan sample size of 113 iems. The space consumpion has a logarihmic dependence of he size of he window because in addiion o he sample iems PSWOR also sores he replacemen se and he prioriy of each iem. Bounded prioriy sampling. BPSWOR produces a sample size of 300 iems in he average case and herefore has a much beer space uilizaion han Bernoulli sampling and PSWOR. When he peak arrives, he sample size firs grows above, hen falls below he 300-iem average. Aferwards i sabilizes again. By Theorem 3, he sample size depends on he raio of he number of iems in he curren window o he number of iems in boh he curren and previous window ogeher. This fracion is roughly consan in he average case bu varies wih he arrival of he peak load. Ineresingly, he scheme almos always uses he enire available memory o sore he candidae iems and he es iems. The space consumpion slighly decreases when he peak arrives. In his case, we sore fewer han k es iems because due o he increased arrival rae candidae iems are replaced by new iems before heir expiraion and so do no become es iems. To summarize, each of he hree schemes has a disincive advanage: Bernoulli sampling does no have any memory overhead, PSWOR guaranees a fixed sample size and BPSWOR samples in bounded space. If he available memory is limied, BPSWOR is he mehod of choice because i produces larger sample sizes han Bernoulli sampling or PSWOR and does no require any a-priori knowledge abou he daa sream. For hese reasons, we do no consider Bernoulli sampling and PSWOR for our real-world experimens. 5.3 Uniform Sampling, Real Daa Nex, we ran BPSWR and BPSWOR on our real-world daases wih a window size of one hour. We moniored he sample size, elapsed ime and he window-size esimae during he sampling process and recorded he respecive values a every full hour. We did no record more frequenly so as o minimize he correlaion beween he measuremens. The experimen was repeaed wih space budges ranging from 1 kbye o 32 kbyes. For each space budge, he experimen was repeaed 32 imes. Sample size. In Figure 4d, we repor he disribuion of he BPSWOR sample size for he NETWORK daase; similar resuls were observed wih BPSWR. We used a space budge of 32 kbyes, which corresponds o a value of k = 862. The figure shows a hisogram of he relaive frequencies for varying sample sizes. As can be seen, he sample size concenraes around he average of 448 iems and varies in he range from 11 o 862 iems. The sandard deviaion of he sample size is 173 and in 95% of he cases, he sample size was larger han 176 iems. By Theorem 3, he sample size depends on he raio of he size of he curren window o he size of boh he prior and he curren window, or he window size raio for shor. In Figure 4e, we give a hisogram of he window size raios in he NETWORK daase. As can be seen, he disribuion of he window size raio has a sriking similariy o he disribuion of he sample size. To furher invesigae his issue, we give a box-and-whisker plo of he sample size for varying ranges of window size raios in Figure 4f. In a box-and-whisker plo, a box ranging from he firs quarile o he hird quarile of he disribuion is drawn around he median value. From he box, whiskers exend o he minimum and maximum values as long as hese values lie

11 wihin 1.5 imes he inerquarile disance (=heigh of he box); he remaining values are reaed as ouliers and are direcly added o he plo. From he figure, i becomes eviden ha he window size raio has a significan influence on he sample size. Also, for each window size raio, he sample size has low variance and is skewed owards larger samples. The skew resuls from he fac ha he wors-case assumpion of Theorem 3 does no always hold in pracice; if i does no hold, he sample size is larger. In Figures 4g, 4h and 4i, we give he corresponding resuls for he SEARCH daase. Since he iems in he SEARCH daase require less space han he NETWORK iems, a larger value of k = 1170 was chosen. As can be seen in he figure, he sample size disribuion is much igher because he arrival rae in he daase does no vary as rapidly. The sample size ranges from 0 iems o 1170 iems, where a value of 0 has only been observed when he window was acually empy. The samples size averages o 579 iems and is larger han 447 iems in 95% of he cases. Performance. In Figure 4j, we compare he performance of BPSWR and BPSWOR for various space budges on he NETWORK daase. The figure shows he average ime in milliseconds required o process a single iem. I has logarihmic axes. For boh algorihms, he per-iem processing ime increases wih an increasing space budge, bu BPSWOR is significanly more efficien han BPSWR. The resuls verify he heoreical analysis in Secion 3.4. Since BPSWOR addiionally samples wihou replacemen, i is clearly superior o BPSWR. Esimaion of window size. In a final experimen wih uniform sampling, we evaluaed he accuracy and precision of he window size esimaor given in Secion 3.5 in erms of is relaive error; he relaive error of an esimae ˆN of N is defined as ˆN N /N. Figure 4k and 4l display he disribuion of he relaive error for he NETWORK and SEARCH daase, respecively, in a kernel-densiy plo. The relaive error is given for memory budges of 32 kbyes, 64 kbyes and 128 kbyes for he enire sample; only he prioriies are acually used for window size esimaion. For boh daases and all sample sizes, he relaive error almos always lies below 10% and ofen is much lower. As he memory budge and hus he value of k increases, he esimaion error decreases; see [3] for a deailed discussion of his behavior. We conclude ha our window size esimaor produces lowerror esimaes and can be used when synopses specialized on window size esimaion are unavailable. 5.4 Sraified Sampling In he nex se of experimens, we compared equi-widh sraificaion wih merge-based sraificaion (MBS). Recall ha during he sampling process, MBS occasionally requires an esimae of he number of iems ha arrive unil he expiraion of he firs sraum. To quanify he impac of esimaion, we considered wo versions of MBS in our experimens. MBS-N makes use of an oracle : Whenever an esimae of he number of arriving iems is required, we deermine he exac number direcly from he daase so ha no esimaion error occurs. MBS-N can herefore be seen as he heoreical opimum of merge-based sraificaion. In conras, MBS- ˆN uses he esimaion echnique and robusness modificaions as described in Secion 4.4. The experimenal seup is idenical o he one used for uniform sampling, ha is, we sample from he real-world daases over a sliding window of 1 hour lengh. Unless saed oherwise, we used a space budge of 32 kbyes and l = 32 sraa. Variance of sraum sizes. We firs compared he variance of he sraum sizes. In order o faciliae a meaningful variance comparison for windows of varying size, we repor he coefficien of variaion (CV) insead of he sraum size variance direcly. The CV is defined as he sandard deviaion (square roo of variance) normalized by he mean sraum size; a value less han 1 indicaes a low-variance disribuion, whereas a value larger han 1 is ofen considered high variance. Figure 4m displays he disribuion of he CV for he NETWORK daase using a kernel-densiy plo. As can be seen, equi-widh sraificaion leads o high values of he CV, while merge-based sraificaion produces significanly beer resuls. Also, MBS-N and MBS- ˆN perform similarly, wih MBS-N being slighly superior. The difference beween equi-widh sraificaion and he MBS schemes is conribued o he bursiness of he NETWORK sream in which he arrival raes vary significanly during a window lengh. In conras, Figure 4n shows he disribuion of he CV for he SEARCH daase. Since he arrival raes change only slowly, equi-widh sraificaion already produces very good resuls and he merge-based schemes essenially never decide o merge wo adjacen sraa. The hree schemes produce almos idenical resuls. Therefore, merge-based sraificaion is he more beneficial he more bursy he daa sream is. Accuracy of esimae (example). In a nex experimen, we used he sraified sampling schemes o esimae he hroughpu of he NETWORK daa from he sample. Here, we defined he hroughpu as he sum of he user-daa size aribue over he enire window (see he CQL query given in he inroducion). Figure 4o gives he disribuion of he relaive error of he esimae. The esimaes derived from he merge-based schemes have a significanly lower esimaion error han he esimaes achieved wih equi-widh sraificaion. Thus, inelligen sraificaion indeed improves he qualiy of he sample. Noe ha for he SEARCH daase, he disribuion of he relaive error would be almos indisinguishable for he hree schemes because for his daase, merge-based sraificaion does no improve upon equi-widh sraificaion. Number of sraa (Example). The number l of sraa can have a significan influence on he qualiy of he esimaes. In Table 1, we give he average of he relaive error (ARE) of he NETWORK hroughpu esimae for a varying number of sraa. Wih an increasing number of sraa, he ARE increases for equi-widh sraificaion bu decreases for he merge-based schemes. On he one hand, he sample size per sraum decreases as l increases and i becomes more and more imporan o disribue he sraa evenly across he window. In fac, when he number of sraa was high, equiwidh sraificaion frequenly produced empy sraa and hereby wased some of he available space. On he oher hand, a large number of sraa beer explois he correlaions beween ime and he aribue of ineres. Thus, he esimaion error ofen decreases wih an increasing value of l. In our experimen, he correlaion of he user-daa size aribue and ime is low, so ha he decrease in esimaion error is also relaively low. Performance. In a final experimen, we measured he average per-iem processing ime for he hree schemes and a varying number of sraa. The resuls for he NETWORK

12 ARE Equi-widh 2.31% 2.73% 3.44% 4.42% 5.90% MBS-N 2.00% 1.83% 1.74% 1.70% 1.72% MBS- ˆN 2.04% 1.88% 1.82% 1.76% 1.79% Time (µs) Equi-widh MBS-N MBS- ˆN Table 1: Influence of he number of sraa (NETWORK) daa are given in Table 1. Clearly, equi-widh sraificaion is he mos efficien echnique and he processing ime does no depend upon he number of sraa. The MBS schemes are slower because hey occasionally have o 1) esimae he number of arriving iems, 2) deermine he opimum sraificaion and 3) merge adjacen sraa. The compuaional effor increases as he number of sraa increases. MBS-N is slighly faser han MBS- ˆN because MBS- ˆN reevaluaes 2) if he sream behaves differenly han prediced. In comparison o equi-widh sraificaion, MBS leads o a significan performance overhead if he number of sraa is large. However, when he number of sraa is no oo large (l 32), he overhead is low bu he qualiy of he resuling sraificaion migh increase significanly. 6. CONCLUSION We have sudied bounded-space echniques for mainaining uniform and sraified samples over a ime-based sliding window of a daa sream. For uniform sampling, we have shown ha any bounded-space sampling scheme ha guaranees a lower bound on he sample size requires expeced space logarihmic o he number of iems in he window; he wors-case space consumpion is a leas as large. Our provably correc BPS scheme is he firs bounded-space sampling scheme for ime-based sliding windows. We have shown how BPS can be exended o efficienly sample wihou replacemen and developed a low-variance esimaor for he number of iems in he window. The sample size produced by BPS is sable in general, bu quick changes of he arrival rae migh lead o emporarily smaller or larger samples. For sraified sampling, we have shown how he sample can be disribued evenly across he window by merging adjacen sraa from ime o ime. The decision of when and which sraa o merge is based on a dynamic programming algorihm, which uses an esimae of he arrival rae o deermine he bes achievable sraum boundaries. MBS is robus agains esimaion errors and produces significanly more balanced samples han equi-widh sraificaion. We found ha he overhead of MBS is small as long as he number of sraa is no oo large. Especially for bursy daa sreams, he increased precision of he esimaes derived from he sample compensaes for he overhead in compuaional cos. 7. REFERENCES [1] Charu C. Aggarwal. On biased reservoir sampling in he presence of sream evoluion. In Proc. VLDB, pages , [2] Brian Babcock, Mayur Daar, and Rajeev Mowani. Sampling from a moving window over sreaming daa. In Proc. SODA, pages , [3] Kevin Beyer, Peer J. Haas, Berhold Reinwald, Yannis Sismanis, and Rainer Gemulla. On synopses for disinc-value esimaion under mulise operaions. In Proc. SIGMOD, pages , [4] Paul G. Brown and Peer J. Haas. Techniques for warehousing of sample daa. In Proc. ICDE, [5] Moses Charikar, Suraji Chaudhuri, Rajeev Mowani, and Vivek Narasayya. Towards esimaion error guaranees for disinc values. In Proc. PODS, pages , [6] Mayur Daar, Arisides Gionis, Pior Indyk, and Rajeev Mowani. Mainaining sream saisics over sliding windows. SIAM J. Compu., 31(6): , [7] Rainer Gemulla, Wolfgang Lehner, and Peer J. Haas. Mainaining bounded-size sample synopses of evolving daases. The VLDB Journal, 17(2): , [8] Phillip B. Gibbons, Yossi Maias, and Viswanah Poosala. Fas incremenal mainenance of approximae hisograms. In Proc. VLDB, pages , [9] Peer J. Haas. Daa sream sampling: Basic echniques and resuls. In Daa Sream Managemen: Processing High Speed Daa Sreams. Springer, [10] H. V. Jagadish, Nick Koudas, S. Muhukrishnan, Viswanah Poosala, Kenneh C. Sevcik, and Torsen Suel. Opimal hisograms wih qualiy guaranees. In Proc. VLDB, pages , [11] Suman Nah, Phillip B. Gibbons, Srinivasan Seshan, and Zachary R. Anderson. Synopsis diffusion for robus aggregaion in sensor neworks. In Proc. SenSys, pages , [12] Carl-Erik Särndal, Beng Swensson, and Jan Wreman. Model Assised Survey Sampling. Springer Series in Saisics. Springer, [13] Raimund Seidel and Cecilia R. Aragon. Randomized search rees. Algorihmica, 16(4/5): , [14] Jeffrey Sco Vier. Faser mehods for random sampling. Commun. ACM, 27(7): , [15] Jeffrey Sco Vier. Random sampling wih a reservoir. ACM TOMS, 11(1):37 57, Repeaabiliy Assessmen Resul All he resuls in his paper were verified by he SIGMOD repeaabiliy commiee. Code and/or daa used in he paper are available a hp:// sigmod2008/.

13 Uniform, synheic Time (hours) Window size (a) Window size Time Sample size Upper bound BPSWOR PSWOR Bernoulli (b) Sample size Time Space consumpion (kbyes) Upper bound BPSWOR PSWOR Bernoulli (c) Space Uniform, NETWORK Sample size Relaive frequency Upper bound (d) Sample size Raio of window sizes Relaive frequency (e) Window size raio [0,0.1] (0.3,0.4] (0.6,0.7] (0.9,1] Raio of window sizes Sample size Upper bound (f) Boh Uniform, SEARCH Sample size Relaive frequency Upper bound (g) Sample size Raio of window sizes Relaive frequency (h) Window size raio [0,0.1] (0.3,0.4] (0.6,0.7] (0.9,1] Raio of window sizes Sample size Upper bound (i) Boh Uniform, real daa Space (kbyes) Milliseconds per iem BPSWR BPSWOR (j) Time (NETWORK) Relaive error of window size esimae Densiy 128 kbyes 64 kbyes 32 kbyes (k) Size es. (NETWORK) Relaive error of window size esimae Densiy 128 kbyes 64 kbyes 32 kbyes (l) Size es. (SEARCH) Sraified, real daa Coefficien of variaion Densiy MBS N MBS N^ Equi widh (m) Sraum size variance (NETWORK) Coefficien of variaion Densiy MBS N MBS N^ Equi widh (n) Sraum size variance (SEARCH) Relaive error of hroughpu esimae Densiy MBS N MBS N^ Equi widh (o) Throughpu esimaion (NETWORK) Figure 4: Experimenal resuls (see subheadings on he lef hand side)