Sampling Time-Based Sliding Windows in Bounded Space

Size: px
Start display at page:

Download "Sampling Time-Based Sliding Windows in Bounded Space"

Transcription

1 Sampling Time-Based Sliding Windows in Bounded Space Rainer Gemulla Technische Universiä Dresden Dresden, Germany Wolfgang Lehner Technische Universiä Dresden Dresden, Germany ABSTRACT Random sampling is an appealing approach o build synopses of large daa sreams because random samples can be used for a broad specrum of analyical asks. Users are ofen ineresed in analyzing only he mos recen fracion of he daa sream in order o avoid oudaed resuls. In his paper, we focus on sampling schemes ha sample from a sliding window over a recen ime inerval; such windows are a popular and highly comprehensible mehod o model recency. In his seing, he main challenge is o guaranee an upper bound on he space consumpion of he sample while using he alloed space efficienly a he same ime. The difficuly arises from he fac ha he number of iems in he window is unknown in advance and may vary significanly over ime, so ha he sampling fracion has o be adjused dynamically. We consider uniform sampling schemes, which produce each sample of he same size wih equal probabiliy, and sraified sampling schemes, in which he window is divided ino smaller sraa and a uniform sample is mainained per sraum. For uniform sampling, we prove ha i is impossible o guaranee a minimum sample size in bounded space. We hen inroduce a novel sampling scheme called bounded prioriy sampling (BPS), which requires only bounded space. We derive a lower bound on he expeced sample size and show ha BPS quickly adaps o changing daa raes. For sraified sampling, we propose a mergebased sraificaion scheme (MBS), which mainains sraa of approximaely equal size. Compared o naive sraificaion, MBS has he advanage ha he sample is evenly disribued across he window, so ha no par of he window is over- or underrepresened. We conclude he paper wih a feasibiliy sudy of our algorihms on large real-world daases. Caegories and Subjec Descripors H.2 [Daabase Managemen]: Miscellaneous; G.3 [Probabiliy and Saisics]: Probabilisic algorihms Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee provided ha copies are no made or disribued for profi or commercial advanage and ha copies bear his noice and he full ciaion on he firs page. To copy oherwise, o republish, o pos on servers or o redisribue o liss, requires prior specific permission and/or a fee. SIGMOD 08, June 9 12, 2008, Vancouver, BC, Canada. Copyrigh 2008 ACM /08/06...$5.00. General Terms Algorihms,heory 1. INTRODUCTION Random sampling echniques are a he hear of every daa sream managemen sysem. In fac, i is ofen infeasible o process and/or sore he enire daa sream in high-speed applicaions like monioring of sensor daa, nework raffic, or ransacion logs. Random sampling is an appealing approach o build synopses of large daa sreams, since mos analyical asks can be execued on a sample eiher direcly or in a slighly modified fashion. For example, random samples can be used o esimae sums, averages and quaniles, bu hey also suppor complex daa mining asks such as clusering. The main challenge in daa sream sampling is o mainain samples ha represen only he par of he daa sream relevan for analysis. Since many analyical asks focus on only a recen par of he daa sream, i is ofen unnecessary or infeasible o mainain a sample of he daa sream in is enirey [1]. Curren research addresses his problem by mainaining so-called sequence-based (SB) samples. In an SB sampling scheme, he probabiliy of an iem being included in he sample depends only on is posiion in he daa sream. Popular varians are he mainenance of a uniform sample of he k mos recen iems [2, 9] or biased sampling schemes in which he inclusion probabiliy of an iem decays as new iems arrive [9, 1]. In general, SB schemes provide for efficien sample mainenance in bounded space. Their disadvanage, however, is ha hey are no well-suied for ime-based analysis. To see his, consider he following simple CQL query: SELECT SUM(size) AS num_byes FROM packes [Range 60 Minues] This query moniors he number of byes observed in he packes sream during he las 60 minues. Suppose ha we wan o answer he above query in a coninuous fashion by mainaining an SB sample of he k mos recen iems. Since he sample is sequence-based and he query is imebased, we have o choose k in such a way ha he las k iems are guaraneed o compleely cover he 60-minue range of he query. Clearly, such a choice is impossible if no a-priori knowledge abou he daa sream is available. Bu even if we can come up wih an upper bound for he number of iems in he query range, SB schemes may perform poorly in pracice. The reason is ha unless he daa sream rae is roughly consan he average window size is much smaller han he

2 upper bound, so ha (wih high probabiliy) he sample conains a large fracion of oudaed iems no relevan for he query. The above problems are addressed by ime-based (TB) sampling schemes [2, 9]. In such a scheme, he inclusion probabiliy of an iem depends on is imesamp raher han on is posiion in he daa sream. In his paper, we focus on sampling schemes ha mainain a random sample of a ime-based sliding window. For example, a random sample of he packes arrived during he las 60 minues can direcly be used o approximaely answer he query above. In TB sampling, he main challenge is o realize an upper bound on he space consumpion of he sample while using he alloed space efficienly a he same ime. Space bounds are crucial in daa sream managemen sysems wih many samples mainained in parallel, since hey grealy simplify he ask of memory managemen and avoid unexpeced memory shorcomings a runime. The difficuly of TB sampling arises from he fac ha he number of iems in he window may vary significanly over ime. If he sampling fracion is kep consan a all imes, he sample shrinks and grows wih he number of iems in he window. The size of he sample is herefore unsable and unbounded. To avoid such behavior, he sampling fracion has o be adaped on-hefly. Inuiively, we can afford a large sampling fracion a low sream raes bu only a low sampling fracion a high raes. Uniform sampling schemes, which produce each sample of he same size wih equal probabiliy, are he mos general of he available sampling schemes. Uniform samples are well undersood and, in fac, many saisical esimaors require an underlying uniform sample. In he conex of daa sream sysems, uniform sampling is applied eiher o he daa sources direcly or o he oupu of some of he operaors in he query graph. I is used o reduce or bound resource consumpion, o suppor ad-hoc querying, o analyze why he sysem produced a specific oupu or o opimize he query graph. Though more efficien echniques exis for some of hese applicaions, uniformiy is a mus if knowledge abou he inended use of he sample is no available a he ime he sample is creaed. This is because uniform samples are no ailored o a specific se of queries bu provide a versaile synopsis of he underlying daa. In his paper, we are concerned wih sampling schemes ha mainain a uniform sample of a ime-based sliding window in bounded space. We show ha any such sampling scheme canno provide hard sample size guaranees. Thus, any bounded-space uniform scheme produces samples of variable size, and sample size guaranees are, if available, probabilisic. We hen inroduce a novel uniform sampling scheme called bounded prioriy sampling (BPS). The scheme is based on prioriy sampling [2] bu requires only bounded space. To he bes of our knowledge, BPS is he firs bounded-space uniform sampling scheme for ime-based sliding windows. We analyze he disribuion of he sample size and show ha he algorihm quickly adaps o changing daa raes. By leveraging resuls from he area of disincvalue esimaion [3], we also show how he number of iems in he window can be esimaed from he sample. For applicaions where he uniformiy requiremen is no crucial, we inroduce a more space-efficien sraified sampling scheme [9]. In sraified sampling, he window is pariioned ino non-overlapping ime inervals called sraa. For each sraum, a uniform sample is mainained. Sraified samples are easier o mainain han uniform samples, bu esimaion becomes more involved. For example, i is no known how o esimae he number of disinc values of an aribue from a sraified sample. If sraified samples can be used, however, esimaes may become more precise [9]. In his paper, we discuss he problem of boh placing sraum boundaries and mainaining he corresponding samples. We develop a merge-based sraified scheme (MBS), which mainains sraa of approximaely equal size. The algorihm merges adjacen sraa from ime o ime; he decision of when o merge and which sraa o merge is a major conribuion of his paper. In our soluion, we rea he problem as an opimizaion problem and give a dynamic programming algorihm o deermine he opimum sraum boundaries. The remainder of he paper is srucured as follows. In Secion 2, we review exising echniques from he daabase and daa sream lieraure. In Secion 3, we discuss he problem of mainaining a uniform sample of a ime-based sliding window and give he deails of he BPS algorihm. Secion 4 inroduces merge-based sraificaion, which is hen used o mainain sraa of approximaely equal size. We presen he resuls of our experimenal evaluaion in Secion 5 and conclude he paper in Secion EXISTING TECHNIQUES In his secion, we review exising sampling echniques and discuss heir applicabiliy for he seing of ime-based sliding windows. Since he focus of his paper is on boundedspace sampling schemes, we also analyze he sample size and space consumpion of he available schemes. Daabase sampling. A variey of sample mainenance echniques have been proposed in he conex of relaional daabase sysems. The mos popular daabase sampling echnique is he reservoir sampling scheme [15]. The scheme mainains a uniform random sample of size k of an inseriononly daase by inerceping inserion requess on heir way o he daase. The idea is o add he firs k insered iems direcly o he sample. Subsequen iems are acceped ino he sample wih probabiliy k/(n +1), where N is he number of iems processed so far, or ignored oherwise. Acceped iems replace a sample iem chosen uniformly a random. Reservoir sampling has been exended o suppor updaes and deleions [8, 7], so ha one migh consider using i for ime-based sliding windows. The idea is o rea each arrival as an inserion ino he window and each expiraion as a deleion from he window. This approach does no work, however, since deleions are explici in relaional daabases bu implici in ime-based sliding windows. Tha means ha he daabase sampling schemes require o be aware of every deleion wheher he deleed iem is sampled or no, while a window sampling scheme only observes he expiraion of sampled iems. For his reason, none of he available daabase sampling schemes can be applied o sliding windows. Sequence-based sampling. In a sequence-based sampling scheme, he probabiliy of an iem being included in he sample depends only on is posiion in he daa sream. Babcock e al. [2] discuss several sampling schemes ha mainain a uniform random sample of he las N elemens of he sream. In [9], a sraified sampling scheme for he same purpose is given. The idea is o pariion he daa sream ino a se of equally-sized sraa and o mainain a reservoir sample

3 of each non-expired sraum. In Secion 4, we apply his idea o ime-based sliding windows; he key difference o [9] is ha he deerminaion of sraum boundaries becomes much harder. An alernaive approach o focus aenion on he recen iems is o mainain a biased sample [9, 1]. In hese schemes, he probabiliy of an iem being sampled decays as new iems arrive. Again, his sequence-based noion of recency does no mach he ime-based noion of analysis, so ha sequence-based schemes can only be used if a-priori knowledge abou he sream is available. Bernoulli sampling. In [2], a modified version of Bernoulli sampling, which mainains a uniform sample of a sequencebased window, has been proposed. The mehod can easily be adaped o he seing of ime-based windows. Le q (0, 1) be he desired sampling rae. In he adaped scheme, each iem is included ino he sample wih probabiliy q independen of he oher iems and excluded wih probabiliy 1 q. Iems are removed from he sample if and only if hey expire. Suppose ha a some arbirary poin in ime, he sliding window conains N iems. Then, he expeced sample size is qn and he acual sample size is close o qn wih high probabiliy. The size of he sample herefore grows and shrinks wih he number of iems in he sliding window. One migh hope ha i is possible o decrease (increase) q dynamically whenever he sample size ges oo large (oo small). However, i has been shown recenly ha such a modificaion of q desroys he uniformiy of he sample [7], so ha i is impossible o conrol he size of a Bernoulli sample. Prioriy sampling. The prioriy sampling scheme [2] mainains a uniform sample of size 1 from a ime-based sliding window. Larger samples can be obained by running muliple prioriy samplers in parallel. The idea is o assign a random prioriy beween 0 and 1 o each arriving iem. A any ime, he algorihm repors he iem wih highes prioriy in he window as he sample iem. Since each iem has he same probabiliy of having he highes prioriy, he scheme is indeed uniform. 1 In order o be able o always repor he highes-prioriy iem, i is boh necessary and sufficien o sore he iems for which here is no elemen wih boh a larger imesamp and a higher prioriy. If, as above, he window conains N iems a an arbirary poin in ime, he scheme requires O(log N) space in expecaion, he acual space requiremen is also O(log N) wih high probabiliy [2]. Thus, he space consumpion of prioriy sampling canno be bounded from above. To summarize, none of he available sampling schemes can be used o mainain a random sample from a ime-based sliding window in bounded space. We address his siuaion and inroduce boh a uniform and a sraified bounded-space sampling scheme. 3. UNIFORM SAMPLING We model a daa sream as an infinie sequence R = (e 1, e 2,...) of iems. Each iem e i has he form ( i, d i), where i R denoes a imesamp and d i D denoes he daa associaed wih he iem. The daa domain D depends on he applicaion; for example, D migh correspond o a finie se of IP addresses or an infinie se of readings from one or more sensors. Throughou he paper, we assume ha i < j for i < j, ha is, he imesamps of he iems are 1 This concep is also known as min-wise sampling [11]. sricly increasing. 2 Denoe by R() he se of iems from R wih a imesamp smaller han or equal o. Denoe by W () = R() \ R( ) a sliding window of lengh and denoe by N () = W () he size of he window a ime. For breviy, we will supress he subscrip in he following. Noe ha we use he erm window lengh o refer o he imespan covered by he window (, fixed) and he erm window size o refer o he number of iems in he window (N(), varying). In his secion, we sudy he problem of mainaining a uniform random sample from W () in bounded space. We consider sampling schemes ha mainain a daa srucure from which a uniform random sample S() of he iems in W () can be exraced a any ime. The disincion beween daa srucure and sample allows o examine he space consumpion and he sample size separaely. A sampling scheme is called uniform if for any A 1, A 2 W () wih A 1 = A 2 he probabiliy P { S() = A 1 } ha he scheme produces A 1 saisfies P { S() = A 1 } = P { S() = A 2 }. Thus, he probabiliy ha a sampling scheme produces A 1 depends only on A 1 and no on is composiion. 3.1 A Negaive Resul One migh hope ha here is a sampling scheme ha is able o mainain a fixed-size uniform sample in bounded space. However, such a scheme does no exis. Theorem 1. Fix some ime and se N = N(). Then, any algorihm ha mainains a fixed-size uniform random sample of size k requires a leas Ω(k log N) space in expecaion. Proof. Le A be an algorihm ha mainains a uniform size-k sample of a ime-based sliding window and denoe by W = { e m+1,..., e m+n } he iems in he window a ime. Furhermore, denoe by j = m+j + he poin in ime when iem e m+j expires, 1 j < N, and se 0 =. Now, consider he case where no new iems arrive in he sream unil all he N iems have expired. Then, le I j be a 0/1-random variable and se I j = 1 if he sample repored by A a ime j conains iem e m+j. Oherwise, se I j = 0. Since A has o sore all iems i evenually repors, i follows ha a ime 0 A sores a leas X = P I j iems. We have o show ha E [ X ] = Ω(k log N). Since A is a uniform sampling scheme, iem e m+1 is repored a ime 0 wih probabiliy k/n. A ime 1, only N 1 iems remain in he window and iem e m+2 is repored wih probabiliy k/(n 1). The argumen can be repeaed unil a ime N k, all he k remaining iems are repored by A. I follows ha ( k/(n j) 0 j < N k P { I j = 1 } = 1 oherwise for 0 j < N. Noe ha only he marginal probabiliies are given in (1); join probabiliies like P { I 1 = 1, I 2 = 1 } 2 The algorihms in his paper also work when i j for i < j, bu we will use he sronger assumpion i < j for exposiory reasons. (1)

4 D D E C C C A B B B B DE C C F F G A B C D E F G H A sample iem B replacemen se candidae iem es iem B F B G B H G a) a) b) a) a) c) Figure 1: Illusraion of PS (above imeline) and BPS (below imeline) depend on he inernals of A. By he lineariy of expeced value, and since E [ I j ] = P { I j = 1 }, we find ha E [ X ] = N 1 X j=0 where H n = P n number. i=1 E [ I j ] = k(h N H k + 1) = Ω(k ln N), 1/i = O(ln n) denoes he nh harmonic I follows direcly ha i is impossible o mainain a fixedsize uniform random sample from a ime-based sliding window in bounded space. By Theorem 1, such mainenance requires expeced space logarihmic o he window size (which is unbounded); he wors-case space consumpion is a leas as large. I is no possible eiher o guaranee a minimum sample size because any algorihm ha guaranees a minimum sample size can be used o mainain a sample of size 1. In he ligh of Theorem 1, we also noe ha he prioriy sampling scheme (see Secion 2) is asympoically opimal in erms of expeced space. However, he algorihm has a muliplicaive overhead of ln N and herefore a low space efficiency. 3.2 Bounding he Space Consumpion We now develop a bounded-space uniform sampling scheme based on prioriy sampling (PS). Recall ha in prioriy sampling, a random prioriy p i chosen uniformly a random from he uniy inerval is associaed wih each iem e i R. The sample S() hen consiss of he iem in W () wih he larges prioriy. In addiion o he sample iem, he scheme sores a se of replacemen iems, which replace he larges-prioriy iem when i expires. This replacemen se consiss of all he iems for which here is no iem wih boh a larger imesamp and a higher prioriy. Figure 1 gives an example of he sampling process. A solid black circle represens he arrival of an iem; is name and prioriy are given below and above, respecively. The verical bars on he imeline indicae he window lengh, iem expiraions are indicaed by whie circles, and double-expiraions 3 are doed whie circles. Above he imeline, he curren sample iem and he se of replacemen iems are shown. I can be seen ha he number of replacemen iems sored by he algorihm varies over ime. In fac, he replacemen se is he reason for he unbounded space consumpion of he sampling scheme: i conains beween 0 and N() 1 iems and roughly ln N() iems on average [2]. 3 An iem ha arrived a ime double-expires a ime +2. G We now describe our bounded-space prioriy sampling (BPS) scheme. The scheme also assigns random prioriies o arriving iems bu sores a mos wo iems in memory: a candidae iem from W () and a es iem from W ( ). The es iem is used o deermine wheher or no he candidae iem is repored as a sample iem, see he discussion below. The mainenance of hese wo iems is as follows: a) Arrival of iem e i. If here is currenly no candidae iem or if he prioriy of e i is larger han he prioriy of he candidae iem, e i becomes he new candidae iem and he old candidae is discarded. Oherwise, he arriving iem is ignored. b) Expiraion of candidae iem. The expired candidae becomes he es iem; we only sore he imesamp and he prioriy of he es iem. There is no candidae iem unil he nex iem arrives in he sream. c) Double-expiraion of es iem. The es iem is discarded. The above algorihm mainains he following invarian: The candidae iem always equals he highes-prioriy iem ha has arrived in he sream since he expiraion of he former candidae iem. This migh or migh no coincide wih he highes-prioriy iem in he curren window and we use he es iem o disinguish beween hese wo cases. Suppose ha a some ime, he candidae iem expires and becomes he es iem. Then he candidae mus have been he highes-prioriy iem in he window righ before is expiraion. (If here were an iem wih a higher prioriy, his iem would have replaced he candidae.) I follows ha whenever he candidae iem has a higher prioriy han he curren es iem, we know ha he candidae is he highesprioriy iem since he arrival of he es iem and herefore since he sar of he curren window. Similarly, whenever here is no es iem sored by BPS, here hasn been an expiraion of a candidae iem for a leas one window lengh, so ha he candidae also equals he highes-prioriy iem in he window. In boh cases, we repor he candidae as a sample iem. Oherwise, if he candidae iem has a lower prioriy han he es iem, we have no means o deec wheher or no he candidae equals he highes-prioriy iem in he window and no sample iem is repored. Before we asser he correcness of BPS and analyze is properies, we give an example of he sampling process in Figure 1. The curren candidae iem and es iem are shown below he imeline. If he candidae iem is shaded, i is repored as a sample iem; oherwise, no sample iem is repored. The leers below he BPS daa srucure refer o cases a), b) and c) above. As long as no expiraion occurs, he candidae sored by BPS equals he highes-prioriy iem in he window and is herefore repored as a sample iem. The siuaion changes as B expires. BPS hen makes iem B he es iem and because here is no candidae iem anymore fails o repor a sample iem. This failure can be seen as a consequence of Theorem 1: BPS is a boundedspace sampling scheme and hus canno guaranee a fixed sample size. Iem F becomes he new candidae iem upon is arrival. However, F is no repored because is prioriy is lower han he prioriy of he es iem B. And in fac, no F bu C is he highes-prioriy iem in he window a his ime. Laer C expires and F does become he highesprioriy iem in he window. However, we sill do no repor

5 F since we are no aware of his siuaion. As G arrives, however, we repor a sample iem again because G has a higher prioriy han he es iem B. Finally, iem B is discarded from he BPS daa srucure as i double-expires. 3.3 Correcness and Analysis We now esablish he correcness of he BPS algorihm. Recall ha BPS produces eiher an empy sample or a single-iem sample. Given ha BPS does produce a sample iem, we have o show ha his iem is chosen uniformly and a random from he iems in he curren sliding window. Theorem 2. BPS is a uniform sampling scheme, ha is, for any e j W (), we have P { S() = { e j } S() = 1 } = 1/N(). Proof. Fix some ime and se S = S(). Denoe by e max he highes-prioriy iem in W () and suppose ha e max has prioriy p max. Furhermore, denoe by e W ( ) he candidae iem sored in he BPS daa srucure a ime (if here is one) and le p be he prioriy of e. Noe ha boh e max and e are random variables. There are 3 cases. Case 1: There is no candidae iem a ime. Then a ime, e max is he candidae iem and here is no es iem. We have S = { e max }. Case 2: Iem e has a smaller prioriy han e max. Then e max is he candidae iem a ime and depending on wheher e expired before or afer he arrival of e max he es iem is eiher equal o e or empy. In boh cases, we have S = { e max }. Case 3: Iem e has a higher prioriy han e max. Then, e is sill he candidae iem a he ime of is expiraion, since here is no higher-prioriy iem in W () ha migh have replaced e. Thus, iem e becomes he es iem upon is expiraion and coninues o be he es iem up o ime i double-expires somewhere in he inerval (, + ). I follows ha no iem is repored a ime so ha S =, because he prioriy of he candidae iem ( p max) is lower han he prioriy p of he es iem. To summarize, we have 8 >< { e max } no candidae iem a ime S = { e max } p max > p (2) >: oherwise. Uniformiy now follows since (2) does no depend on he values, imesamps or order of he individual iems in W (). For any e j W (), we have P { S = { e j } S = 1 } = P { e j = e max } = 1/N() and he heorem follows. We now analyze he sample size of he BPS scheme. Clearly, he sample size is probabilisic and is exac disribuion depends on he enire hisory of he daa sream. However, in he ligh of Theorem 3 below, i becomes eviden ha we can sill provide a local lower bound on he probabiliy ha he scheme produces a sample iem. The lower bound is local because i changes over ime; we canno guaranee a global lower bound oher han 0 ha holds a any arbirary ime wihou a-priori knowledge of he daa sream. Theorem 3. The probabiliy ha BPS succeeds in producing a sample iem a ime is bounded from below by P { S() = 1 } N() N( ) + N(). Proof. BPS produces a sample iem if he highesprioriy iem e max W () has a higher prioriy han he candidae iem e sored in he BPS daa srucure righ before he sar of W (); see (2) above. In he wors case, e equals he highes-prioriy iem in W ( ). Now suppose ha we order he iems in W ( ) W () in descending order of heir prioriies. BPS succeeds for sure if he firs of he ordered iems is an elemen of W (). Since he prioriies are independen and idenically disribued, his even occurs wih probabiliy N()/(N( ) + N()) and he asserion of he heorem follows. If he arrival rae of he iems in he daa sream is consan so ha N() = N( ), BPS succeeds wih probabiliy of a leas 50%. If he rae increases or decreases, he success probabiliy will also increase or decrease, respecively. 3.4 Sampling Muliple Iems The BPS scheme as given above can be used o mainain a single-iem sample. A sraighforward way o obain larger samples is o run k independen BPS samplers S 1,..., S k in parallel; we refer o his scheme as BPS wih replacemen (BPSWR). The sample is hen se o S = S 1 S k. We have E [ S ] = kx N() P { S i = 1 } k N( ) + N() i=1 by he lineariy of he expeced value. However, his approach has wo major drawbacks. Firs, he sample S is a wih-replacemen sample, ha is, each iem in he window may be sampled more han once. The ne sample size afer duplicae removal migh herefore be smaller han S. Second and more imporanly, he mainenance of he k independen samples is expensive. Since a single copy of he BPS daa srucure requires consan ime per arriving iem, he per-iem processing ime is O(k) and he oal ime o process a window of size N is O(kN). If k is large, he overhead o mainain he sample can be significan. We now develop a wihou-replacemen sampling scheme called BPSWOR. In general, wihou-replacemen samples are preferable since hey conain more informaion abou he daa. The scheme is as follows: we modify BPS so as o sore k candidaes and k es iems simulaneously. Denoe by S cand he se of candidaes and by S es he se of es iems. The sampling process is similar o BPS: An arriving iem e becomes a candidae when eiher S cand < k or e has a higher prioriy han he lowes-prioriy iem in S cand. In he laer case, he lowes-prioriy iem is discarded in favor of e. As before, expiring candidaes become es iems and double-expiring es iems are discarded. The sample S() is hen given by S() = op-k `S cand () S es() S cand (), where op-k(a) deermines he iems in A wih he k highes prioriies. Noe ha for k = 1, BPSWR and BPSWOR coincide. S() is hen a uniform random sample of W () wihou replacemen; he proof is similar o he proof of

6 Theorem 2. Also, using an argumen as in he proof of Theorem 3, we can show ha E [ S() ] kn()/(n( ) + N()). Thus, BPSWR and BPSWOR have he same lower bound on he expeced (gross) sample size. The cos of processing a window of size N is O(kN) if he candidaes are sored in a simple array. A more efficien approach which also improves he cos in comparison o BPSWR is o sore he candidaes in a reap, where he iems are arranged in order wih respec o he imesamps and in heap-order wih respec o he prioriies. The expeced cos of BPSWOR hen decreases o O(N + k log k log N) in expecaion. 4 Noe ha we can also modify PS o sample wihou replacemen. The so-modified PSWOR scheme hen repors he iems wih he k highes prioriies in he window. In order o mainain hese k iems incremenally, we sore each iem as long as here are fewer han k more recen iems wih a higher prioriy. The space consumpion is sill O(k log N) in expecaion, bu efficien mainenance of he replacemen se becomes challenging. Since he focus of his paper is on bounded-space sampling schemes, we do no furher elaborae on his issue. 3.5 Esimaion of Window Size For some applicaions, i is imporan o be able o esimae he window size in order o make effecive use of he sample. For example, he window sum of an aribue is ypically esimaed as he sample average of he respecive aribue muliplied by he window size. Thus in some applicaions knowledge of he window size is imporan o deermine scale-up facors. Exac mainenance of he number of iems in he window requires ha we sore all he imesamps in he window in order o deal wih expiraions. Typically, his approach is infeasible in pracice. Approximae daa srucures [6] do exis and can be leveraged o suppor he sampling process. If such alernae daa srucures are unavailable, we can come up wih an esimae of he window size direcly from he sample. Se W 2() = W ( ) W () and denoe by p (k) he prioriy of he iem wih he kh highes prioriy in W 2(). In [3], i has been shown ha an unbiased esimaor for N() is given by ˆN W () = W () op-k W2() k k 1. 1 p (k) Here, he firs facor esimaes he fracion of non-expired iems in W 2() from he op-k iems (which can be viewed as a random sample of W 2), while he second facor is an esimae of W 2() iself. Now, suppose ha we mainain he sample using BPSWOR. Se S 2() = S cand S es and denoe by p (k) he prioriy of he iem wih he kh highes prioriy in S 2. Consider he esimaor ˆN S() = S() op-k S2() k k 1. 1 p (k) This esimaor is similar o ˆN W () bu solely accesses informaion available in he sample. Boh esimaors coincide if and only if op-k S 2() = op-k W 2(). This happens if 4 Following an argumen as in [3], a mos O(k log N) iems of he window are acceped ino he candidae se in expecaion and each acceped iem incurs an expeced cos of O(log k) [13]. A mos k iems (double-)expire while processing a window, so ha he expeced cos o process (double-)expiraions is O(k log k). a leas W ( ) op-k W 2() iems have been repored as he sample a ime. Oherwise, he firs facor in ˆN S() will overesimae he firs facor in ˆN W (), while he second facor will underesimae he respecive facor in ˆN W (). In our experimens, we found ha he esimaor ˆN S has negligible bias and low variance. Thus, boh overand underesimaion seem o balance smoohly, hough we do no make any formal claims here. 4. STRATIFIED SAMPLING We now consider he problem of mainaining a sraified sample of a ime-based sliding window. The general idea is o pariion he window ino disjoin sraa and o mainain a uniform sample of each sraum [9]. Sraified sampling is ofen superior o uniform sampling because a sraified scheme explois correlaions beween ime and he quaniy of ineres. As will become eviden laer on, sraificaion also allows us o mainain larger samples han wih BPS in he same space. The main drawback of sraified sampling is is limied applicabiliy; for some problems, i is difficul or even impossible o compue a global soluion from he differen subsamples. For example, i is no known how he number of disinc values can be esimaed from a sraified sample, while he problem has been sudied exensively for uniform samples [5]. If, however, he desired analyical asks can be performed on a sraified sample, sraificaion is ofen he mehod of choice. We consider sraified sampling schemes, which pariion he window ino l > 1 sraa and mainain a uniform sample S i of each sraum, 1 i l. Each sample has a fixed size of n iems. In addiion o he sample, we also sore he sraum size N i and he imesamp i of he upper sraum boundary; hese wo quaniies are required for sample mainenance. The main challenge in sraified sampling is he placemen of sraum boundaries because hey have a significan impac on he qualiy of he sample. 5 In he simples version, he sream is divided ino sraa of equal widh (ime inervals); we refer o his sraegy as equi-widh sraificaion. An alernaive sraegy is equi-deph sraificaion, where he window is pariioned ino sraa of equal size (number of iems). Equi-deph sraificaion ouperforms equi-widh sraificaion when he arrival rae of he daa sream varies inside a window, bu he sraa are much more difficul o mainain. In fac, perfec equi-deph sraificaion is impossible (see below), so ha approximae soluions are needed. In his secion, we develop a merge-based sraificaion sraegy, which approximaes equi-deph sraificaion o he bes possible exen. Figure 2 illusraes equi-widh sraificaion wih parameers l = 4 and n = 1; sampled iems are represened by solid black circles. The figure displays a snapsho of he sample a 3 differen poins in ime, which are arranged verically and ermed a), b) and c). Noe ha he righmos sraum ends a he righ window boundary and grows as new iems arrive, while he lefmos sraum exceeds he window and may conain expired iems. The mainenance of he sraified sample is significanly simpler han he mainenance of a uniform sample because arrivals and expiraions are no 5 To see his, consider he simple case where all iems in he window fall ino only one of he l sraa. In his case, a fracion of 100(l 1)/l% of he available space remains unused.

7 a) b) c) window sraum boundary A B C D E F G sampled iem B C D E F G H C D E F G H I JK Figure 2: Equi-widh sraificaion expiraion 1 s sraum inermixed wihin sraa. Arriving iems are added o he righmos sraum and since no expiraions can occur we can use reservoir sampling o mainain he sample incremenally (see Secion 2). On he conrary, expiraions only affec he lefmos sraum. We remove expired iems from he respecive sample; he remaining sample sill represens a uniform sample of he non-expired par of he sraum [8]. 4.1 Effec of Sraum Sizes The main advanage of equi-widh sraificaion is is simpliciy, he main disadvanage is ha he sampling fracion may vary widely across he sraa. In he example of Figure 2c), he sampling fracions of he firs, second and hird sraum are given by 50%, 100% and 16%, respecively. In general, dense regions of he sream are underrepresened by an equi-widh sample, while sparse regions are overrepresened. Thus, we wan o sraify he daa sream in such a way ha each sraum has approximaely he same size and herefore he same sampling fracion; we refer o his approach as equi-deph sraificaion. Unforunaely, perfec equi-deph sraificaion is no realizable in pracice because he daa sream is unknown in advance and we canno move sraum boundaries arbirarily. Before we inroduce our approximae merge-based algorihm, we discuss he relaionship of sraum sizes and accuracy wih he help of a simple example. Suppose ha we wan o esimae he window average µ of some aribue of he sream from a sraified sample and assume for simpliciy ha he respecive aribue is normally disribued wih mean µ and variance σ 2. Furher suppose ha a some ime he window conains N iems and is divided ino l sraa of sizes N 1,..., N l wih P N i = N. Then, he sandard Horviz-Thompson esimaor ˆµ of µ is a weighed average of he per-sraum sample averages [12], P ha is ˆµ = 1 l N i=1 Ni ˆµi, where ˆµi is he sample average of he ih sraum. The esimaor has variance Var [ ˆµ ] = 1 N 2 lx Ni 2 Var [ ˆµ i ] = i=1 σ2 nn 2 lx Ni 2, where we used Var [ ˆµ i ] = σ 2 /n. Thus, he variance of he esimaor is proporional o he sum of he squares of he sraum sizes, or similarly, he variance of he sraum sizes: Var [ N 1,..., N l ] = lx i=1 i=1 N i N «2 P N 2 = i l l «2 N l (3) The variance is minimized if all sraa have he same size (bes case) and maximized if one sraum conains all he iems in he window (wors case). The above example is exremely simplified because we designed he sream in such a way ha he variance Var [ ˆµ i ] of he esimae is equal in all sraa. In general, sraificaion is he more efficien he higher he correlaion of he aribue of ineres wih ime ges (because ime is he sraificaion variable). In his paper, however, we assume ha no informaion abou he inended use of he sample is available; in his case, our bes guess is o assume equal variance in each sraum. Thus, he variance of he sraum sizes as given in (3) can be used o quanify he qualiy of a given sraificaion. 4.2 Merge-Based Sraificaion Perfec equi-deph sraificaion is impossible, since we canno reposiion sraum boundaries arbirarily. To see his, consider he sae of he sample as given in Figure 2c). To achieve equi-deph sraificaion, we would have o (1) remove he sraum boundary beween iems D and E, and (2) inroduce a new sraum boundary beween H and I. Here, (1) represens a merge of he firs and second sraum. In [4], Brown e al. have shown ha such a merge is possible, ha is, a sample of he merged sraum can be compued from he samples of he individual sraa. In he example, he merged sample would conain iem C wih probabiliy 2/3 and iem E wih probabiliy 1/3. In conras, (2) represens a spli of he hird sraum ino wo new sraa, one conaining iems F -H and one conaining iems I-K. In he case of a spli, i is neiher possible o compue he samples of he wo new sraa nor o deermine he sraum sizes. In he example, prior o he spli, he hird sraum has size 6 and he sample conains iem I. Based on his informaion, i is impossible o come up wih a sample of sraum F -H; we canno even deermine ha sraum F -H conains 3 iems. Our merge-based sraified sampling scheme (MBS) approximaes equi-deph sraificaion o he exen possible. The main idea is o merge wo adjacen sraa from ime o ime. Such a merge reduces he informaion sored abou he wo sraa bu creaes free space a he end of he sample, which can be used for fuure iems. In Figure 3, we illusrae MBS on he example daa sream. We sar as before wih he 4 sraa given in a). Righ afer he arrival of iem H, we merge sraum C-D wih sraum E o obain sraum C-E. The decision of when and which sraa o merge is he major challenge of he algorihm. Afer a merge, we use he freed space o sar a new, iniially empy sraum. The sae of he sample afer he creaion of he new sraum is shown in b). Subsequen arrivals are added o he new sraum (iems I, J and K). Finally, sraum A-B expires and, again, a fresh sraum is creaed; see c). Noe ha he sample is much more balanced han wih equi-widh sraificaion (Figure 2). Before we discuss when o merge, we briefly describe how o merge. Suppose ha we wan o merge wo adjacen sraa R 1 and R 2 wih R 1, R 2 n. Denoe by S i, N i, i he uniform sample (of size n), he sraum size and he upper boundary of sraum R i, i { 1, 2 }. Then, he merged sraum R = R 1 R 2 has size N 1 + N 2 and upper boundary 2. In [4], Brown e al. have shown how o merge S 1 and S 2 o obain a uniform sample S of R 1 R 2 wih S = n. Le X be a random variable for he number of iems from

8 a) b) c) A B C D E F G B C D E F G H C D E F G H I JK Figure 3: Merge-based sraificaion hrough merging hrough expiraion R 1 in a size-n uniform sample drawn direcly from R. X is hypergeomerically disribued wih!!! P { X = x } = N1 N 2 N 1 + N 2 x n x n for 0 k n. Since all he disribuion parameers are known, we can obain a realizaion x of X by hrowing a dice. Then, we compue uniform subsamples S 1 and S 2 from S 1 and S 2, respecively, wih S 1 = x and S 2 = n x. The subsamples can be compued using reservoir sampling, hough more efficien sampling schemes exis for his purpose [14]. The final sample S is hen se o he union of S 1 and S 2; see [4] for a proof of he uniformiy of S. 4.3 When To Merge Which Sraa The decision of when and which sraa o merge is crucial for merge-based sraificaion. Suppose ha a some ime, he window is divided ino l sraa R 1,..., R l of size N 1,..., N l, respecively. During he subsequen sampling process, a new sraum is creaed when eiher (1) sraum R 1 expires or (2) wo adjacen sraa are merged. Observe ha we have no influence on (1), bu we can apply (2) as needed. We now rea he problem of when and which sraa o merge as an opimizaion problem, where he opimizaion goal is o minimize he variance of he sraum sizes a he ime of he expiraion of R 1. Therefore whenever he firs sraum expires he sample looks as much like an equi-deph sample as possible. Denoe by R + = { e 1,..., e N + } he se of iems ha arrive unil he expiraion of sraum R 1 (bu have no ye arrived) and se N + = R +. 6 A he ime of R 1 s expiraion and before he creaion of he new sraum, he window is divided ino l 1 sraa so ha here are l 2 inner sraum boundaries. The posiions of he sraum boundaries depend on boh he number and poin in ime of any merges we perform. Our algorihm ress on he observaion ha for any way of puing l 2 sraum boundaries in he sequence R 2, R 3,..., R l, e 1, e 2,..., e N +, here is a leas one corresponding sequence of merges ha resuls in he respecive sraificaion. For example, he sraificaion R 2 R 3 R l, e 1,..., e N + is achieved if no merge is performed (verical bars denoe boundaries), while R 2 R i, R i+1 R l, e 1,..., e j e j+1,..., e N + 6 In pracice, N + is no known in advance; we address his issue in Secion 4.4. is achieved if sraum R i and R i+1 are merged afer he arrival of iem e j and before he arrival of iem e j+1. In general, for every sraum boundary in beween R l, e 1,..., e N +, we drop a sraum boundary in beween R 2,..., R l by performing a merge operaion a he respecive poin in ime. We can now reformulae he opimizaion problem: Find he pariioning of he inegers N 2,..., N l, 1,..., 1 {z } N + imes ino l 1 consecuive and non-empy pariions so ha he variance (or sum of squares) of he inra-pariion sums is minimized. The problem can be solved using dynamic programming in O(l(l+N + ) 2 ) ime [10]. In our specific insance of he problem, however, he las N + values of he sequence of inegers are all equal o 1. As shown below, we can leverage his fac o consruc a dynamic programming algorihm ha obains an opimum soluion in only O(l 3 ) ime. Since N + is ypically large, he improvemen in performance can be significan. The algorihm is as follows. Le op(k, i) be he minimum sum of squares when k of he l 2 boundaries are placed beween N 2,..., N l and he las one of hese k boundaries is placed righ afer N i; 0 k l 2 and k < i < l. Then, op(k, i) can be decomposed ino wo funcions op(k, i) = f(k, i) + g(k, i), where f(k, i) is he minimum sum of squares for he k pariions lef of and including N i and g(k, i) is he minimum sum of squares for he l k 1 pariions righ of N i. The decomposiion significanly reduces he complexiy because he compuaion of g does no involve any opimizaion. To define g(k, i), observe ha by definiion, here are no boundaries in beween N i+1,..., N l, so ha hese values fall ino a single pariion and we can sum hem up. The resuling par of he ineger sequence is hen where N a,b = P b j=a Nj.7 N i+1,l, 1,..., 1, In fac, g is minimized if all he l k 1 pariions have he same size, ha is, size N i+1,l+n +. l k 1 If N i+1,l is larger han his average size, he minimum value of g canno be obained. In his case, he bes choice is o pu N i+1,l in one sraum for is own; he remaining l k 2 N pariions hen all have size +. Thus, he funcion g is l k 2 given by g(k, i) 8 < Ni+1,l +N (l k 1) + 2 N l k 1 i+1,l < N i+1,l+n + l k 1 = : Ni+1,l 2 N + (l k 2) + 2 oherwise. l k 2 The funcion f can be defined recursively wih f(0, i) = N2,i 2 f(k, i) = min f(k 1, j) + N 2 k j<i j+1,i. and he opimum soluion is given by min 0 k l 2 min op(k, i). k<i<l 7 N a,b can be compued in consan ime wih he help of an array conaining he prefix sums N 2,2,..., N 2,l [10].

9 To compue he opimum soluion, we ierae over k in increasing order and memoize he values of f(k, ); hese values will be reused for he compuaion of f(k + 1, ). The global soluion and he corresponding sraum boundaries are racked during he process. Since each of he loop variables k, i and j ake a mos l differen values, he oal ime complexiy is O(l 3 ). The algorihm requires O(l) space. 4.4 Esimaion of Arriving-Iem Coun The decision of when o merge is dependen on he number N + of iems ha arrive unil he expiraion of he firs sraum. In pracice, N + is unknown and has o be esimaed. In his secion, we propose a simple and fas-o-compue esimaor for N +. Especially for bursy daa sreams, esimaion errors can occur; we herefore discuss how o make MBS robus agains esimaion errors. As before, suppose ha a some ime he sample consiss of l sraa of sizes N 1,..., N l and denoe by i he upper boundary of he ih sraum, 1 i l. Furhermore, denoe by = 1 + he ime span unil he expiraion of he firs sraum. We wan o predic he number of iems ha arrive unil ime +. Denoe by j he sraum index such ha j > and j+1. An esimae ˆN + of N + is hen given by ˆN + = P l i=j+1 Ni j. The esimae roughly equals he amoun of iems ha arrived in he las ime unis. The inuiion behind his esimaor is ha he amoun of hisory we use for esimaion depends on how far we wan o exrapolae ino he fuure. In conjuncion wih he robusness echniques discussed below, his approach showed a good performance in our experimens. Whenever a sraum expires, we compue he esimae ˆN + and based on his esimae deermine he opimum sequence of merges using he algorihm given in Secion 4.3. Denoe by ˆm 0 he oal number of merges in he resuling sequence and by ˆN + 1 he number of iems ha arrive before he firs merge. In general, we now wai for ˆN + 1 iems o arrive in he sream and hen perform a merge operaion. Noe ha he value of ˆm ( ˆN + 1 ) is a monoonically increasing (decreasing) funcion of ˆN + ; we perform he more merges he more iems arrive before he expiraion of he firs sraum. Thus, underesimaion may lead o oo few merges and overesimaion may lead o oo many merges. To make MBS robus agains esimaion errors, we recompue he sequence of merges whenever we observe ha he daa sream behaves differenly han prediced. There are wo cases: ˆm = 0: We recompue ˆm and ˆN + 1 only if more han ˆN + iems arrive in he sream, so ha a merge may become profiable. This sraegy is opimal if ˆN + N + bu migh oherwise lead o a ardy merge. ˆm > 0: Denoe by ˆ = ˆN + 1 ˆN + he esimaed ime span unil he arrival of he ˆN + 1 -h iem. We recompue he esimaes if he ˆN + 1 -h iem does no arrive close o ime +ˆ. For concreeness, recompuaion is riggered if eiher he ˆN + 1 -h iem arrives before ime + (1 ɛ)ˆ or when fewer han ˆN + 1 iems arrived a ime +(1+ɛ)ˆ, where 0 < ɛ < 1 deermines he validiy inerval of he esimae and is usually se o a small value, say 5%. In our experimens, he variance of he sraum sizes achieved by MBS wihou a-priori knowledge of N + was almos as low as he one achieved by MBS wih a-priori knowledge of N EXPERIMENTS We implemened bounded-space prioriy sampling wih and wihou replacemen (BPSWR/BPSWOR), prioriy sampling wihou replacemen (PSWOR), Bernoulli sampling and he sraified sampling schemes in Java 1.6. The experimens have been run on a worksaion PC wih a 3 GHz Inel Penium 4 processor and 2.5 GB main memory. Almos all of he experimens have been run on real-world daases because we fel ha synheic daases canno capure he complex disribuion of real-world arrival raes. We used wo real daases, which reflec wo differen ypes of daa sreams frequenly found in pracice. The NETWORK daase, which conains nework raffic daa, has a very bursy arrival rae wih high shor-erm peaks. In conras, he SEARCH daase conains usage saisics of a search engine and he arrival rae changes slowly; i basically depends on he ime of day. These wo daases allowed us o sudy he influence of he evoluion of he arrival raes on he sampling process. The NETWORK daase has been colleced by monioring one of our web servers for a period of 1 monh. The daase conains 8, 430, 904 iems, where each iem represens a TCP packe and consiss of a imesamp (8 byes), a source IP and por (4 + 2 byes), a desinaion IP and por (4 + 2 byes) and he size of he user daa (2 byes). The SEARCH daase has been colleced in a period of 3 monhs and conains 36, 389, 565 iems. Each iem consiss of a imesamp (8 byes) and a user id (4 byes). For mos of our experimens, we do no repor he esimaion error of a specific esimae derived from he sample bu raher give he key characerisics ha influence he esimaion error of any poenial esimae. This way, our resuls are independen of he acual values associaed wih he iems in our daases. In he case of uniform sampling, he key characerisic is he sample size. Two uniform samples of he same size are idenical in disribuion, no maer which scheme has been used o compue hem. Larger samples ineviably lead o a smaller esimaion error. For sraified sampling, he key characerisic is he variance of he sraum sizes. This variance is a direc measure of how close sraificaion is o equi-deph sraificaion. A smaller variance ypically resuls in less esimaion error. 5.1 Summary of Experimenal Resuls For uniform sampling, we found ha: BPSWOR is he mehod of choice when he available memory is limied and he daa sream rae is varying. I hen produces larger samples han Bernoulli sampling or PSWOR. Also, BPSWOR is he only scheme ha does no require a-priori informaion abou he daa sream and guaranees an upper bound on he memory consumpion. The window size raio of he curren window o boh he curren and previous window has a significan impac on he sample size of BPSWOR. A small raio leads o smaller samples, while a large raio resuls in larger samples. For a given raio, he sample size has low variance and is skewed owards larger samples.

10 BPSWOR is superior o BPSWR because i is significanly faser and samples wihou replacemen. The window size esimae discussed in Secion 3.5 has low relaive error. The relaive error decreases wih an increasing sample size. For sraified sampling, we found ha: Merge-based sraificaion leads o significanly lower sraum size variances han equi-widh sraificaion when he daa sream is bursy. Boh schemes have comparable performance when he daa sream rae changes slowly. Merge-based sraificaion seems o be robus o errors in he arrival rae esimae. Resuls wih esimaed arrival raes are close o he heoreical opimum. When he number of sraa is no oo large ( 32), he overhead of merge-based sraificaion is low. 5.2 Uniform Sampling, Synheic Daa In a firs experimen, we compared Bernoulli sampling, PSWOR and BPSWOR. Neiher Bernoulli sampling nor PSWOR can guaranee an upper bound on he space consumpion and wihou a-priori knowledge of he sream i is no possible o paramerize hem o only infrequenly exceed he space bound. The goal of his experimen is o compare he sample size and space consumpion of he hree schemes under he assumpion ha such a paramerizaion is possible. For his purpose, we generaed a synheic daa sream, where each iem of he daa sream consiss of an 8-bye imesamp and 32 byes of dummy daa. To generae he imesamps, we modeled he arrival rae of he sream using a sine curve wih a 24h-period, which akes values beween 3, 000 and 5, 000 iems per hour. We superimposed he probabiliy densiy funcion (PDF) of a normal disribuion wih mean 24 and variance 0.5 on he sine curve; he PDF has been scaled so ha i akes a maximum of 30, 000 iems per hour. This models real-world scenarios where he peak arrival rae (scaled PDF) is much higher han he average arrival rae (sine curve). We used he hree sampling schemes o mainain a sample from a sliding window of 1 hour lengh; he window size over ime is given in Figure 4a. We used a space budge of 32 kbyes; a mos 819 iems can be sored in 32 kbyes space. For he sampling schemes, we used parameers k BPSWOR = 585 (number of candidae/es iems), k PSWOR = 113 (sample size) and q Bernoulli = (sampling rae). The laer wo parameers have been chosen so ha he expeced space consumpion a he peak arrival rae equals 32 kbyes as discussed above, his paramerizaion is only possible because we know he behavior of he sream in advance. During he sampling process, we moniored boh sample size and space consumpion; he resuls are given in Figure 4b and 4c, respecively. Bernoulli sampling. The size of he Bernoulli sample follows he size of he window: I flucuaes around 110 iems in he average case bu says close o he 819 iems a peak imes. The space consumpion of he sample is proporional o he sample size; a large fracion of he available space remains unused in he average case. Prioriy sampling. PSWOR produces a consan sample size of 113 iems. The space consumpion has a logarihmic dependence of he size of he window because in addiion o he sample iems PSWOR also sores he replacemen se and he prioriy of each iem. Bounded prioriy sampling. BPSWOR produces a sample size of 300 iems in he average case and herefore has a much beer space uilizaion han Bernoulli sampling and PSWOR. When he peak arrives, he sample size firs grows above, hen falls below he 300-iem average. Aferwards i sabilizes again. By Theorem 3, he sample size depends on he raio of he number of iems in he curren window o he number of iems in boh he curren and previous window ogeher. This fracion is roughly consan in he average case bu varies wih he arrival of he peak load. Ineresingly, he scheme almos always uses he enire available memory o sore he candidae iems and he es iems. The space consumpion slighly decreases when he peak arrives. In his case, we sore fewer han k es iems because due o he increased arrival rae candidae iems are replaced by new iems before heir expiraion and so do no become es iems. To summarize, each of he hree schemes has a disincive advanage: Bernoulli sampling does no have any memory overhead, PSWOR guaranees a fixed sample size and BPSWOR samples in bounded space. If he available memory is limied, BPSWOR is he mehod of choice because i produces larger sample sizes han Bernoulli sampling or PSWOR and does no require any a-priori knowledge abou he daa sream. For hese reasons, we do no consider Bernoulli sampling and PSWOR for our real-world experimens. 5.3 Uniform Sampling, Real Daa Nex, we ran BPSWR and BPSWOR on our real-world daases wih a window size of one hour. We moniored he sample size, elapsed ime and he window-size esimae during he sampling process and recorded he respecive values a every full hour. We did no record more frequenly so as o minimize he correlaion beween he measuremens. The experimen was repeaed wih space budges ranging from 1 kbye o 32 kbyes. For each space budge, he experimen was repeaed 32 imes. Sample size. In Figure 4d, we repor he disribuion of he BPSWOR sample size for he NETWORK daase; similar resuls were observed wih BPSWR. We used a space budge of 32 kbyes, which corresponds o a value of k = 862. The figure shows a hisogram of he relaive frequencies for varying sample sizes. As can be seen, he sample size concenraes around he average of 448 iems and varies in he range from 11 o 862 iems. The sandard deviaion of he sample size is 173 and in 95% of he cases, he sample size was larger han 176 iems. By Theorem 3, he sample size depends on he raio of he size of he curren window o he size of boh he prior and he curren window, or he window size raio for shor. In Figure 4e, we give a hisogram of he window size raios in he NETWORK daase. As can be seen, he disribuion of he window size raio has a sriking similariy o he disribuion of he sample size. To furher invesigae his issue, we give a box-and-whisker plo of he sample size for varying ranges of window size raios in Figure 4f. In a box-and-whisker plo, a box ranging from he firs quarile o he hird quarile of he disribuion is drawn around he median value. From he box, whiskers exend o he minimum and maximum values as long as hese values lie

11 wihin 1.5 imes he inerquarile disance (=heigh of he box); he remaining values are reaed as ouliers and are direcly added o he plo. From he figure, i becomes eviden ha he window size raio has a significan influence on he sample size. Also, for each window size raio, he sample size has low variance and is skewed owards larger samples. The skew resuls from he fac ha he wors-case assumpion of Theorem 3 does no always hold in pracice; if i does no hold, he sample size is larger. In Figures 4g, 4h and 4i, we give he corresponding resuls for he SEARCH daase. Since he iems in he SEARCH daase require less space han he NETWORK iems, a larger value of k = 1170 was chosen. As can be seen in he figure, he sample size disribuion is much igher because he arrival rae in he daase does no vary as rapidly. The sample size ranges from 0 iems o 1170 iems, where a value of 0 has only been observed when he window was acually empy. The samples size averages o 579 iems and is larger han 447 iems in 95% of he cases. Performance. In Figure 4j, we compare he performance of BPSWR and BPSWOR for various space budges on he NETWORK daase. The figure shows he average ime in milliseconds required o process a single iem. I has logarihmic axes. For boh algorihms, he per-iem processing ime increases wih an increasing space budge, bu BPSWOR is significanly more efficien han BPSWR. The resuls verify he heoreical analysis in Secion 3.4. Since BPSWOR addiionally samples wihou replacemen, i is clearly superior o BPSWR. Esimaion of window size. In a final experimen wih uniform sampling, we evaluaed he accuracy and precision of he window size esimaor given in Secion 3.5 in erms of is relaive error; he relaive error of an esimae ˆN of N is defined as ˆN N /N. Figure 4k and 4l display he disribuion of he relaive error for he NETWORK and SEARCH daase, respecively, in a kernel-densiy plo. The relaive error is given for memory budges of 32 kbyes, 64 kbyes and 128 kbyes for he enire sample; only he prioriies are acually used for window size esimaion. For boh daases and all sample sizes, he relaive error almos always lies below 10% and ofen is much lower. As he memory budge and hus he value of k increases, he esimaion error decreases; see [3] for a deailed discussion of his behavior. We conclude ha our window size esimaor produces lowerror esimaes and can be used when synopses specialized on window size esimaion are unavailable. 5.4 Sraified Sampling In he nex se of experimens, we compared equi-widh sraificaion wih merge-based sraificaion (MBS). Recall ha during he sampling process, MBS occasionally requires an esimae of he number of iems ha arrive unil he expiraion of he firs sraum. To quanify he impac of esimaion, we considered wo versions of MBS in our experimens. MBS-N makes use of an oracle : Whenever an esimae of he number of arriving iems is required, we deermine he exac number direcly from he daase so ha no esimaion error occurs. MBS-N can herefore be seen as he heoreical opimum of merge-based sraificaion. In conras, MBS- ˆN uses he esimaion echnique and robusness modificaions as described in Secion 4.4. The experimenal seup is idenical o he one used for uniform sampling, ha is, we sample from he real-world daases over a sliding window of 1 hour lengh. Unless saed oherwise, we used a space budge of 32 kbyes and l = 32 sraa. Variance of sraum sizes. We firs compared he variance of he sraum sizes. In order o faciliae a meaningful variance comparison for windows of varying size, we repor he coefficien of variaion (CV) insead of he sraum size variance direcly. The CV is defined as he sandard deviaion (square roo of variance) normalized by he mean sraum size; a value less han 1 indicaes a low-variance disribuion, whereas a value larger han 1 is ofen considered high variance. Figure 4m displays he disribuion of he CV for he NETWORK daase using a kernel-densiy plo. As can be seen, equi-widh sraificaion leads o high values of he CV, while merge-based sraificaion produces significanly beer resuls. Also, MBS-N and MBS- ˆN perform similarly, wih MBS-N being slighly superior. The difference beween equi-widh sraificaion and he MBS schemes is conribued o he bursiness of he NETWORK sream in which he arrival raes vary significanly during a window lengh. In conras, Figure 4n shows he disribuion of he CV for he SEARCH daase. Since he arrival raes change only slowly, equi-widh sraificaion already produces very good resuls and he merge-based schemes essenially never decide o merge wo adjacen sraa. The hree schemes produce almos idenical resuls. Therefore, merge-based sraificaion is he more beneficial he more bursy he daa sream is. Accuracy of esimae (example). In a nex experimen, we used he sraified sampling schemes o esimae he hroughpu of he NETWORK daa from he sample. Here, we defined he hroughpu as he sum of he user-daa size aribue over he enire window (see he CQL query given in he inroducion). Figure 4o gives he disribuion of he relaive error of he esimae. The esimaes derived from he merge-based schemes have a significanly lower esimaion error han he esimaes achieved wih equi-widh sraificaion. Thus, inelligen sraificaion indeed improves he qualiy of he sample. Noe ha for he SEARCH daase, he disribuion of he relaive error would be almos indisinguishable for he hree schemes because for his daase, merge-based sraificaion does no improve upon equi-widh sraificaion. Number of sraa (Example). The number l of sraa can have a significan influence on he qualiy of he esimaes. In Table 1, we give he average of he relaive error (ARE) of he NETWORK hroughpu esimae for a varying number of sraa. Wih an increasing number of sraa, he ARE increases for equi-widh sraificaion bu decreases for he merge-based schemes. On he one hand, he sample size per sraum decreases as l increases and i becomes more and more imporan o disribue he sraa evenly across he window. In fac, when he number of sraa was high, equiwidh sraificaion frequenly produced empy sraa and hereby wased some of he available space. On he oher hand, a large number of sraa beer explois he correlaions beween ime and he aribue of ineres. Thus, he esimaion error ofen decreases wih an increasing value of l. In our experimen, he correlaion of he user-daa size aribue and ime is low, so ha he decrease in esimaion error is also relaively low. Performance. In a final experimen, we measured he average per-iem processing ime for he hree schemes and a varying number of sraa. The resuls for he NETWORK

12 ARE Equi-widh 2.31% 2.73% 3.44% 4.42% 5.90% MBS-N 2.00% 1.83% 1.74% 1.70% 1.72% MBS- ˆN 2.04% 1.88% 1.82% 1.76% 1.79% Time (µs) Equi-widh MBS-N MBS- ˆN Table 1: Influence of he number of sraa (NETWORK) daa are given in Table 1. Clearly, equi-widh sraificaion is he mos efficien echnique and he processing ime does no depend upon he number of sraa. The MBS schemes are slower because hey occasionally have o 1) esimae he number of arriving iems, 2) deermine he opimum sraificaion and 3) merge adjacen sraa. The compuaional effor increases as he number of sraa increases. MBS-N is slighly faser han MBS- ˆN because MBS- ˆN reevaluaes 2) if he sream behaves differenly han prediced. In comparison o equi-widh sraificaion, MBS leads o a significan performance overhead if he number of sraa is large. However, when he number of sraa is no oo large (l 32), he overhead is low bu he qualiy of he resuling sraificaion migh increase significanly. 6. CONCLUSION We have sudied bounded-space echniques for mainaining uniform and sraified samples over a ime-based sliding window of a daa sream. For uniform sampling, we have shown ha any bounded-space sampling scheme ha guaranees a lower bound on he sample size requires expeced space logarihmic o he number of iems in he window; he wors-case space consumpion is a leas as large. Our provably correc BPS scheme is he firs bounded-space sampling scheme for ime-based sliding windows. We have shown how BPS can be exended o efficienly sample wihou replacemen and developed a low-variance esimaor for he number of iems in he window. The sample size produced by BPS is sable in general, bu quick changes of he arrival rae migh lead o emporarily smaller or larger samples. For sraified sampling, we have shown how he sample can be disribued evenly across he window by merging adjacen sraa from ime o ime. The decision of when and which sraa o merge is based on a dynamic programming algorihm, which uses an esimae of he arrival rae o deermine he bes achievable sraum boundaries. MBS is robus agains esimaion errors and produces significanly more balanced samples han equi-widh sraificaion. We found ha he overhead of MBS is small as long as he number of sraa is no oo large. Especially for bursy daa sreams, he increased precision of he esimaes derived from he sample compensaes for he overhead in compuaional cos. 7. REFERENCES [1] Charu C. Aggarwal. On biased reservoir sampling in he presence of sream evoluion. In Proc. VLDB, pages , [2] Brian Babcock, Mayur Daar, and Rajeev Mowani. Sampling from a moving window over sreaming daa. In Proc. SODA, pages , [3] Kevin Beyer, Peer J. Haas, Berhold Reinwald, Yannis Sismanis, and Rainer Gemulla. On synopses for disinc-value esimaion under mulise operaions. In Proc. SIGMOD, pages , [4] Paul G. Brown and Peer J. Haas. Techniques for warehousing of sample daa. In Proc. ICDE, [5] Moses Charikar, Suraji Chaudhuri, Rajeev Mowani, and Vivek Narasayya. Towards esimaion error guaranees for disinc values. In Proc. PODS, pages , [6] Mayur Daar, Arisides Gionis, Pior Indyk, and Rajeev Mowani. Mainaining sream saisics over sliding windows. SIAM J. Compu., 31(6): , [7] Rainer Gemulla, Wolfgang Lehner, and Peer J. Haas. Mainaining bounded-size sample synopses of evolving daases. The VLDB Journal, 17(2): , [8] Phillip B. Gibbons, Yossi Maias, and Viswanah Poosala. Fas incremenal mainenance of approximae hisograms. In Proc. VLDB, pages , [9] Peer J. Haas. Daa sream sampling: Basic echniques and resuls. In Daa Sream Managemen: Processing High Speed Daa Sreams. Springer, [10] H. V. Jagadish, Nick Koudas, S. Muhukrishnan, Viswanah Poosala, Kenneh C. Sevcik, and Torsen Suel. Opimal hisograms wih qualiy guaranees. In Proc. VLDB, pages , [11] Suman Nah, Phillip B. Gibbons, Srinivasan Seshan, and Zachary R. Anderson. Synopsis diffusion for robus aggregaion in sensor neworks. In Proc. SenSys, pages , [12] Carl-Erik Särndal, Beng Swensson, and Jan Wreman. Model Assised Survey Sampling. Springer Series in Saisics. Springer, [13] Raimund Seidel and Cecilia R. Aragon. Randomized search rees. Algorihmica, 16(4/5): , [14] Jeffrey Sco Vier. Faser mehods for random sampling. Commun. ACM, 27(7): , [15] Jeffrey Sco Vier. Random sampling wih a reservoir. ACM TOMS, 11(1):37 57, Repeaabiliy Assessmen Resul All he resuls in his paper were verified by he SIGMOD repeaabiliy commiee. Code and/or daa used in he paper are available a hp:// sigmod2008/.

13 Uniform, synheic Time (hours) Window size (a) Window size Time Sample size Upper bound BPSWOR PSWOR Bernoulli (b) Sample size Time Space consumpion (kbyes) Upper bound BPSWOR PSWOR Bernoulli (c) Space Uniform, NETWORK Sample size Relaive frequency Upper bound (d) Sample size Raio of window sizes Relaive frequency (e) Window size raio [0,0.1] (0.3,0.4] (0.6,0.7] (0.9,1] Raio of window sizes Sample size Upper bound (f) Boh Uniform, SEARCH Sample size Relaive frequency Upper bound (g) Sample size Raio of window sizes Relaive frequency (h) Window size raio [0,0.1] (0.3,0.4] (0.6,0.7] (0.9,1] Raio of window sizes Sample size Upper bound (i) Boh Uniform, real daa Space (kbyes) Milliseconds per iem BPSWR BPSWOR (j) Time (NETWORK) Relaive error of window size esimae Densiy 128 kbyes 64 kbyes 32 kbyes (k) Size es. (NETWORK) Relaive error of window size esimae Densiy 128 kbyes 64 kbyes 32 kbyes (l) Size es. (SEARCH) Sraified, real daa Coefficien of variaion Densiy MBS N MBS N^ Equi widh (m) Sraum size variance (NETWORK) Coefficien of variaion Densiy MBS N MBS N^ Equi widh (n) Sraum size variance (SEARCH) Relaive error of hroughpu esimae Densiy MBS N MBS N^ Equi widh (o) Throughpu esimaion (NETWORK) Figure 4: Experimenal resuls (see subheadings on he lef hand side)

Multiprocessor Systems-on-Chips

Multiprocessor Systems-on-Chips Par of: Muliprocessor Sysems-on-Chips Edied by: Ahmed Amine Jerraya and Wayne Wolf Morgan Kaufmann Publishers, 2005 2 Modeling Shared Resources Conex swiching implies overhead. On a processing elemen,

More information

Task is a schedulable entity, i.e., a thread

Task is a schedulable entity, i.e., a thread Real-Time Scheduling Sysem Model Task is a schedulable eniy, i.e., a hread Time consrains of periodic ask T: - s: saring poin - e: processing ime of T - d: deadline of T - p: period of T Periodic ask T

More information

PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE

PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE Profi Tes Modelling in Life Assurance Using Spreadshees PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE Erik Alm Peer Millingon 2004 Profi Tes Modelling in Life Assurance Using Spreadshees

More information

Single-machine Scheduling with Periodic Maintenance and both Preemptive and. Non-preemptive jobs in Remanufacturing System 1

Single-machine Scheduling with Periodic Maintenance and both Preemptive and. Non-preemptive jobs in Remanufacturing System 1 Absrac number: 05-0407 Single-machine Scheduling wih Periodic Mainenance and boh Preempive and Non-preempive jobs in Remanufacuring Sysem Liu Biyu hen Weida (School of Economics and Managemen Souheas Universiy

More information

Chapter 1.6 Financial Management

Chapter 1.6 Financial Management Chaper 1.6 Financial Managemen Par I: Objecive ype quesions and answers 1. Simple pay back period is equal o: a) Raio of Firs cos/ne yearly savings b) Raio of Annual gross cash flow/capial cos n c) = (1

More information

Real-time Particle Filters

Real-time Particle Filters Real-ime Paricle Filers Cody Kwok Dieer Fox Marina Meilă Dep. of Compuer Science & Engineering, Dep. of Saisics Universiy of Washingon Seale, WA 9895 ckwok,fox @cs.washingon.edu, [email protected] Absrac

More information

Making a Faster Cryptanalytic Time-Memory Trade-Off

Making a Faster Cryptanalytic Time-Memory Trade-Off Making a Faser Crypanalyic Time-Memory Trade-Off Philippe Oechslin Laboraoire de Securié e de Crypographie (LASEC) Ecole Polyechnique Fédérale de Lausanne Faculé I&C, 1015 Lausanne, Swizerland [email protected]

More information

Measuring macroeconomic volatility Applications to export revenue data, 1970-2005

Measuring macroeconomic volatility Applications to export revenue data, 1970-2005 FONDATION POUR LES ETUDES ET RERS LE DEVELOPPEMENT INTERNATIONAL Measuring macroeconomic volailiy Applicaions o expor revenue daa, 1970-005 by Joël Cariolle Policy brief no. 47 March 01 The FERDI is a

More information

MACROECONOMIC FORECASTS AT THE MOF A LOOK INTO THE REAR VIEW MIRROR

MACROECONOMIC FORECASTS AT THE MOF A LOOK INTO THE REAR VIEW MIRROR MACROECONOMIC FORECASTS AT THE MOF A LOOK INTO THE REAR VIEW MIRROR The firs experimenal publicaion, which summarised pas and expeced fuure developmen of basic economic indicaors, was published by he Minisry

More information

SELF-EVALUATION FOR VIDEO TRACKING SYSTEMS

SELF-EVALUATION FOR VIDEO TRACKING SYSTEMS SELF-EVALUATION FOR VIDEO TRACKING SYSTEMS Hao Wu and Qinfen Zheng Cenre for Auomaion Research Dep. of Elecrical and Compuer Engineering Universiy of Maryland, College Park, MD-20742 {wh2003, qinfen}@cfar.umd.edu

More information

INTEREST RATE FUTURES AND THEIR OPTIONS: SOME PRICING APPROACHES

INTEREST RATE FUTURES AND THEIR OPTIONS: SOME PRICING APPROACHES INTEREST RATE FUTURES AND THEIR OPTIONS: SOME PRICING APPROACHES OPENGAMMA QUANTITATIVE RESEARCH Absrac. Exchange-raded ineres rae fuures and heir opions are described. The fuure opions include hose paying

More information

Individual Health Insurance April 30, 2008 Pages 167-170

Individual Health Insurance April 30, 2008 Pages 167-170 Individual Healh Insurance April 30, 2008 Pages 167-170 We have received feedback ha his secion of he e is confusing because some of he defined noaion is inconsisen wih comparable life insurance reserve

More information

Performance Center Overview. Performance Center Overview 1

Performance Center Overview. Performance Center Overview 1 Performance Cener Overview Performance Cener Overview 1 ODJFS Performance Cener ce Cener New Performance Cener Model Performance Cener Projec Meeings Performance Cener Execuive Meeings Performance Cener

More information

A Note on Using the Svensson procedure to estimate the risk free rate in corporate valuation

A Note on Using the Svensson procedure to estimate the risk free rate in corporate valuation A Noe on Using he Svensson procedure o esimae he risk free rae in corporae valuaion By Sven Arnold, Alexander Lahmann and Bernhard Schwezler Ocober 2011 1. The risk free ineres rae in corporae valuaion

More information

Mathematics in Pharmacokinetics What and Why (A second attempt to make it clearer)

Mathematics in Pharmacokinetics What and Why (A second attempt to make it clearer) Mahemaics in Pharmacokineics Wha and Why (A second aemp o make i clearer) We have used equaions for concenraion () as a funcion of ime (). We will coninue o use hese equaions since he plasma concenraions

More information

CHARGE AND DISCHARGE OF A CAPACITOR

CHARGE AND DISCHARGE OF A CAPACITOR REFERENCES RC Circuis: Elecrical Insrumens: Mos Inroducory Physics exs (e.g. A. Halliday and Resnick, Physics ; M. Sernheim and J. Kane, General Physics.) This Laboraory Manual: Commonly Used Insrumens:

More information

Why Did the Demand for Cash Decrease Recently in Korea?

Why Did the Demand for Cash Decrease Recently in Korea? Why Did he Demand for Cash Decrease Recenly in Korea? Byoung Hark Yoo Bank of Korea 26. 5 Absrac We explores why cash demand have decreased recenly in Korea. The raio of cash o consumpion fell o 4.7% in

More information

Morningstar Investor Return

Morningstar Investor Return Morningsar Invesor Reurn Morningsar Mehodology Paper Augus 31, 2010 2010 Morningsar, Inc. All righs reserved. The informaion in his documen is he propery of Morningsar, Inc. Reproducion or ranscripion

More information

The Transport Equation

The Transport Equation The Transpor Equaion Consider a fluid, flowing wih velociy, V, in a hin sraigh ube whose cross secion will be denoed by A. Suppose he fluid conains a conaminan whose concenraion a posiion a ime will be

More information

The Application of Multi Shifts and Break Windows in Employees Scheduling

The Application of Multi Shifts and Break Windows in Employees Scheduling The Applicaion of Muli Shifs and Brea Windows in Employees Scheduling Evy Herowai Indusrial Engineering Deparmen, Universiy of Surabaya, Indonesia Absrac. One mehod for increasing company s performance

More information

MTH6121 Introduction to Mathematical Finance Lesson 5

MTH6121 Introduction to Mathematical Finance Lesson 5 26 MTH6121 Inroducion o Mahemaical Finance Lesson 5 Conens 2.3 Brownian moion wih drif........................... 27 2.4 Geomeric Brownian moion........................... 28 2.5 Convergence of random

More information

Duration and Convexity ( ) 20 = Bond B has a maturity of 5 years and also has a required rate of return of 10%. Its price is $613.

Duration and Convexity ( ) 20 = Bond B has a maturity of 5 years and also has a required rate of return of 10%. Its price is $613. Graduae School of Business Adminisraion Universiy of Virginia UVA-F-38 Duraion and Convexiy he price of a bond is a funcion of he promised paymens and he marke required rae of reurn. Since he promised

More information

ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS

ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS R. Caballero, E. Cerdá, M. M. Muñoz and L. Rey () Deparmen of Applied Economics (Mahemaics), Universiy of Málaga,

More information

Automatic measurement and detection of GSM interferences

Automatic measurement and detection of GSM interferences Auomaic measuremen and deecion of GSM inerferences Poor speech qualiy and dropped calls in GSM neworks may be caused by inerferences as a resul of high raffic load. The radio nework analyzers from Rohde

More information

Journal Of Business & Economics Research September 2005 Volume 3, Number 9

Journal Of Business & Economics Research September 2005 Volume 3, Number 9 Opion Pricing And Mone Carlo Simulaions George M. Jabbour, (Email: [email protected]), George Washingon Universiy Yi-Kang Liu, ([email protected]), George Washingon Universiy ABSTRACT The advanage of Mone Carlo

More information

PATHWISE PROPERTIES AND PERFORMANCE BOUNDS FOR A PERISHABLE INVENTORY SYSTEM

PATHWISE PROPERTIES AND PERFORMANCE BOUNDS FOR A PERISHABLE INVENTORY SYSTEM PATHWISE PROPERTIES AND PERFORMANCE BOUNDS FOR A PERISHABLE INVENTORY SYSTEM WILLIAM L. COOPER Deparmen of Mechanical Engineering, Universiy of Minnesoa, 111 Church Sree S.E., Minneapolis, MN 55455 [email protected]

More information

BALANCE OF PAYMENTS. First quarter 2008. Balance of payments

BALANCE OF PAYMENTS. First quarter 2008. Balance of payments BALANCE OF PAYMENTS DATE: 2008-05-30 PUBLISHER: Balance of Paymens and Financial Markes (BFM) Lena Finn + 46 8 506 944 09, [email protected] Camilla Bergeling +46 8 506 942 06, [email protected]

More information

Chapter 8: Regression with Lagged Explanatory Variables

Chapter 8: Regression with Lagged Explanatory Variables Chaper 8: Regression wih Lagged Explanaory Variables Time series daa: Y for =1,..,T End goal: Regression model relaing a dependen variable o explanaory variables. Wih ime series new issues arise: 1. One

More information

Distributing Human Resources among Software Development Projects 1

Distributing Human Resources among Software Development Projects 1 Disribuing Human Resources among Sofware Developmen Proecs Macario Polo, María Dolores Maeos, Mario Piaini and rancisco Ruiz Summary This paper presens a mehod for esimaing he disribuion of human resources

More information

TSG-RAN Working Group 1 (Radio Layer 1) meeting #3 Nynashamn, Sweden 22 nd 26 th March 1999

TSG-RAN Working Group 1 (Radio Layer 1) meeting #3 Nynashamn, Sweden 22 nd 26 th March 1999 TSG-RAN Working Group 1 (Radio Layer 1) meeing #3 Nynashamn, Sweden 22 nd 26 h March 1999 RAN TSGW1#3(99)196 Agenda Iem: 9.1 Source: Tile: Documen for: Moorola Macro-diversiy for he PRACH Discussion/Decision

More information

Option Put-Call Parity Relations When the Underlying Security Pays Dividends

Option Put-Call Parity Relations When the Underlying Security Pays Dividends Inernaional Journal of Business and conomics, 26, Vol. 5, No. 3, 225-23 Opion Pu-all Pariy Relaions When he Underlying Securiy Pays Dividends Weiyu Guo Deparmen of Finance, Universiy of Nebraska Omaha,

More information

11/6/2013. Chapter 14: Dynamic AD-AS. Introduction. Introduction. Keeping track of time. The model s elements

11/6/2013. Chapter 14: Dynamic AD-AS. Introduction. Introduction. Keeping track of time. The model s elements Inroducion Chaper 14: Dynamic D-S dynamic model of aggregae and aggregae supply gives us more insigh ino how he economy works in he shor run. I is a simplified version of a DSGE model, used in cuing-edge

More information

Risk Modelling of Collateralised Lending

Risk Modelling of Collateralised Lending Risk Modelling of Collaeralised Lending Dae: 4-11-2008 Number: 8/18 Inroducion This noe explains how i is possible o handle collaeralised lending wihin Risk Conroller. The approach draws on he faciliies

More information

Principal components of stock market dynamics. Methodology and applications in brief (to be updated ) Andrei Bouzaev, bouzaev@ya.

Principal components of stock market dynamics. Methodology and applications in brief (to be updated ) Andrei Bouzaev, bouzaev@ya. Principal componens of sock marke dynamics Mehodology and applicaions in brief o be updaed Andrei Bouzaev, [email protected] Why principal componens are needed Objecives undersand he evidence of more han one

More information

INTRODUCTION TO FORECASTING

INTRODUCTION TO FORECASTING INTRODUCTION TO FORECASTING INTRODUCTION: Wha is a forecas? Why do managers need o forecas? A forecas is an esimae of uncerain fuure evens (lierally, o "cas forward" by exrapolaing from pas and curren

More information

I. Basic Concepts (Ch. 1-4)

I. Basic Concepts (Ch. 1-4) (Ch. 1-4) A. Real vs. Financial Asses (Ch 1.2) Real asses (buildings, machinery, ec.) appear on he asse side of he balance shee. Financial asses (bonds, socks) appear on boh sides of he balance shee. Creaing

More information

TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS

TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS RICHARD J. POVINELLI AND XIN FENG Deparmen of Elecrical and Compuer Engineering Marquee Universiy, P.O.

More information

USE OF EDUCATION TECHNOLOGY IN ENGLISH CLASSES

USE OF EDUCATION TECHNOLOGY IN ENGLISH CLASSES USE OF EDUCATION TECHNOLOGY IN ENGLISH CLASSES Mehme Nuri GÖMLEKSİZ Absrac Using educaion echnology in classes helps eachers realize a beer and more effecive learning. In his sudy 150 English eachers were

More information

Supplementary Appendix for Depression Babies: Do Macroeconomic Experiences Affect Risk-Taking?

Supplementary Appendix for Depression Babies: Do Macroeconomic Experiences Affect Risk-Taking? Supplemenary Appendix for Depression Babies: Do Macroeconomic Experiences Affec Risk-Taking? Ulrike Malmendier UC Berkeley and NBER Sefan Nagel Sanford Universiy and NBER Sepember 2009 A. Deails on SCF

More information

Analysis of Pricing and Efficiency Control Strategy between Internet Retailer and Conventional Retailer

Analysis of Pricing and Efficiency Control Strategy between Internet Retailer and Conventional Retailer Recen Advances in Business Managemen and Markeing Analysis of Pricing and Efficiency Conrol Sraegy beween Inerne Reailer and Convenional Reailer HYUG RAE CHO 1, SUG MOO BAE and JOG HU PARK 3 Deparmen of

More information

Hedging with Forwards and Futures

Hedging with Forwards and Futures Hedging wih orwards and uures Hedging in mos cases is sraighforward. You plan o buy 10,000 barrels of oil in six monhs and you wish o eliminae he price risk. If you ake he buy-side of a forward/fuures

More information

SPEC model selection algorithm for ARCH models: an options pricing evaluation framework

SPEC model selection algorithm for ARCH models: an options pricing evaluation framework Applied Financial Economics Leers, 2008, 4, 419 423 SEC model selecion algorihm for ARCH models: an opions pricing evaluaion framework Savros Degiannakis a, * and Evdokia Xekalaki a,b a Deparmen of Saisics,

More information

Chapter 6: Business Valuation (Income Approach)

Chapter 6: Business Valuation (Income Approach) Chaper 6: Business Valuaion (Income Approach) Cash flow deerminaion is one of he mos criical elemens o a business valuaion. Everyhing may be secondary. If cash flow is high, hen he value is high; if he

More information

How To Predict A Person'S Behavior

How To Predict A Person'S Behavior Informaion Theoreic Approaches for Predicive Models: Resuls and Analysis Monica Dinculescu Supervised by Doina Precup Absrac Learning he inernal represenaion of parially observable environmens has proven

More information

Dynamic programming models and algorithms for the mutual fund cash balance problem

Dynamic programming models and algorithms for the mutual fund cash balance problem Submied o Managemen Science manuscrip Dynamic programming models and algorihms for he muual fund cash balance problem Juliana Nascimeno Deparmen of Operaions Research and Financial Engineering, Princeon

More information

Stock Trading with Recurrent Reinforcement Learning (RRL) CS229 Application Project Gabriel Molina, SUID 5055783

Stock Trading with Recurrent Reinforcement Learning (RRL) CS229 Application Project Gabriel Molina, SUID 5055783 Sock raing wih Recurren Reinforcemen Learning (RRL) CS9 Applicaion Projec Gabriel Molina, SUID 555783 I. INRODUCION One relaively new approach o financial raing is o use machine learning algorihms o preic

More information

Present Value Methodology

Present Value Methodology Presen Value Mehodology Econ 422 Invesmen, Capial & Finance Universiy of Washingon Eric Zivo Las updaed: April 11, 2010 Presen Value Concep Wealh in Fisher Model: W = Y 0 + Y 1 /(1+r) The consumer/producer

More information

The option pricing framework

The option pricing framework Chaper 2 The opion pricing framework The opion markes based on swap raes or he LIBOR have become he larges fixed income markes, and caps (floors) and swapions are he mos imporan derivaives wihin hese markes.

More information

Chapter 7. Response of First-Order RL and RC Circuits

Chapter 7. Response of First-Order RL and RC Circuits Chaper 7. esponse of Firs-Order L and C Circuis 7.1. The Naural esponse of an L Circui 7.2. The Naural esponse of an C Circui 7.3. The ep esponse of L and C Circuis 7.4. A General oluion for ep and Naural

More information

Credit Index Options: the no-armageddon pricing measure and the role of correlation after the subprime crisis

Credit Index Options: the no-armageddon pricing measure and the role of correlation after the subprime crisis Second Conference on The Mahemaics of Credi Risk, Princeon May 23-24, 2008 Credi Index Opions: he no-armageddon pricing measure and he role of correlaion afer he subprime crisis Damiano Brigo - Join work

More information

II.1. Debt reduction and fiscal multipliers. dbt da dpbal da dg. bal

II.1. Debt reduction and fiscal multipliers. dbt da dpbal da dg. bal Quarerly Repor on he Euro Area 3/202 II.. Deb reducion and fiscal mulipliers The deerioraion of public finances in he firs years of he crisis has led mos Member Saes o adop sizeable consolidaion packages.

More information

Time Series Analysis Using SAS R Part I The Augmented Dickey-Fuller (ADF) Test

Time Series Analysis Using SAS R Part I The Augmented Dickey-Fuller (ADF) Test ABSTRACT Time Series Analysis Using SAS R Par I The Augmened Dickey-Fuller (ADF) Tes By Ismail E. Mohamed The purpose of his series of aricles is o discuss SAS programming echniques specifically designed

More information

The Grantor Retained Annuity Trust (GRAT)

The Grantor Retained Annuity Trust (GRAT) WEALTH ADVISORY Esae Planning Sraegies for closely-held, family businesses The Granor Reained Annuiy Trus (GRAT) An efficien wealh ransfer sraegy, paricularly in a low ineres rae environmen Family business

More information

Does Option Trading Have a Pervasive Impact on Underlying Stock Prices? *

Does Option Trading Have a Pervasive Impact on Underlying Stock Prices? * Does Opion Trading Have a Pervasive Impac on Underlying Sock Prices? * Neil D. Pearson Universiy of Illinois a Urbana-Champaign Allen M. Poeshman Universiy of Illinois a Urbana-Champaign Joshua Whie Universiy

More information

Chapter 4: Exponential and Logarithmic Functions

Chapter 4: Exponential and Logarithmic Functions Chaper 4: Eponenial and Logarihmic Funcions Secion 4.1 Eponenial Funcions... 15 Secion 4. Graphs of Eponenial Funcions... 3 Secion 4.3 Logarihmic Funcions... 4 Secion 4.4 Logarihmic Properies... 53 Secion

More information

17 Laplace transform. Solving linear ODE with piecewise continuous right hand sides

17 Laplace transform. Solving linear ODE with piecewise continuous right hand sides 7 Laplace ransform. Solving linear ODE wih piecewise coninuous righ hand sides In his lecure I will show how o apply he Laplace ransform o he ODE Ly = f wih piecewise coninuous f. Definiion. A funcion

More information

Nikkei Stock Average Volatility Index Real-time Version Index Guidebook

Nikkei Stock Average Volatility Index Real-time Version Index Guidebook Nikkei Sock Average Volailiy Index Real-ime Version Index Guidebook Nikkei Inc. Wih he modificaion of he mehodology of he Nikkei Sock Average Volailiy Index as Nikkei Inc. (Nikkei) sars calculaing and

More information

Bayesian Filtering with Online Gaussian Process Latent Variable Models

Bayesian Filtering with Online Gaussian Process Latent Variable Models Bayesian Filering wih Online Gaussian Process Laen Variable Models Yali Wang Laval Universiy [email protected] Marcus A. Brubaker TTI Chicago [email protected] Brahim Chaib-draa Laval Universiy

More information

Random Walk in 1-D. 3 possible paths x vs n. -5 For our random walk, we assume the probabilities p,q do not depend on time (n) - stationary

Random Walk in 1-D. 3 possible paths x vs n. -5 For our random walk, we assume the probabilities p,q do not depend on time (n) - stationary Random Walk in -D Random walks appear in many cones: diffusion is a random walk process undersanding buffering, waiing imes, queuing more generally he heory of sochasic processes gambling choosing he bes

More information

DDoS Attacks Detection Model and its Application

DDoS Attacks Detection Model and its Application DDoS Aacks Deecion Model and is Applicaion 1, MUHAI LI, 1 MING LI, XIUYING JIANG 1 School of Informaion Science & Technology Eas China Normal Universiy No. 500, Dong-Chuan Road, Shanghai 0041, PR. China

More information

Task-Execution Scheduling Schemes for Network Measurement and Monitoring

Task-Execution Scheduling Schemes for Network Measurement and Monitoring Task-Execuion Scheduling Schemes for Nework Measuremen and Monioring Zhen Qin, Robero Rojas-Cessa, and Nirwan Ansari Deparmen of Elecrical and Compuer Engineering New Jersey Insiue of Technology Universiy

More information

Hotel Room Demand Forecasting via Observed Reservation Information

Hotel Room Demand Forecasting via Observed Reservation Information Proceedings of he Asia Pacific Indusrial Engineering & Managemen Sysems Conference 0 V. Kachivichyanuul, H.T. Luong, and R. Piaaso Eds. Hoel Room Demand Forecasing via Observed Reservaion Informaion aragain

More information

Markov Chain Modeling of Policy Holder Behavior in Life Insurance and Pension

Markov Chain Modeling of Policy Holder Behavior in Life Insurance and Pension Markov Chain Modeling of Policy Holder Behavior in Life Insurance and Pension Lars Frederik Brand Henriksen 1, Jeppe Woemann Nielsen 2, Mogens Seffensen 1, and Chrisian Svensson 2 1 Deparmen of Mahemaical

More information

Economics Honors Exam 2008 Solutions Question 5

Economics Honors Exam 2008 Solutions Question 5 Economics Honors Exam 2008 Soluions Quesion 5 (a) (2 poins) Oupu can be decomposed as Y = C + I + G. And we can solve for i by subsiuing in equaions given in he quesion, Y = C + I + G = c 0 + c Y D + I

More information

Market Liquidity and the Impacts of the Computerized Trading System: Evidence from the Stock Exchange of Thailand

Market Liquidity and the Impacts of the Computerized Trading System: Evidence from the Stock Exchange of Thailand 36 Invesmen Managemen and Financial Innovaions, 4/4 Marke Liquidiy and he Impacs of he Compuerized Trading Sysem: Evidence from he Sock Exchange of Thailand Sorasar Sukcharoensin 1, Pariyada Srisopisawa,

More information

THE FIRM'S INVESTMENT DECISION UNDER CERTAINTY: CAPITAL BUDGETING AND RANKING OF NEW INVESTMENT PROJECTS

THE FIRM'S INVESTMENT DECISION UNDER CERTAINTY: CAPITAL BUDGETING AND RANKING OF NEW INVESTMENT PROJECTS VII. THE FIRM'S INVESTMENT DECISION UNDER CERTAINTY: CAPITAL BUDGETING AND RANKING OF NEW INVESTMENT PROJECTS The mos imporan decisions for a firm's managemen are is invesmen decisions. While i is surely

More information

DYNAMIC MODELS FOR VALUATION OF WRONGFUL DEATH PAYMENTS

DYNAMIC MODELS FOR VALUATION OF WRONGFUL DEATH PAYMENTS DYNAMIC MODELS FOR VALUATION OF WRONGFUL DEATH PAYMENTS Hong Mao, Shanghai Second Polyechnic Universiy Krzyszof M. Osaszewski, Illinois Sae Universiy Youyu Zhang, Fudan Universiy ABSTRACT Liigaion, exper

More information

The naive method discussed in Lecture 1 uses the most recent observations to forecast future values. That is, Y ˆ t + 1

The naive method discussed in Lecture 1 uses the most recent observations to forecast future values. That is, Y ˆ t + 1 Business Condiions & Forecasing Exponenial Smoohing LECTURE 2 MOVING AVERAGES AND EXPONENTIAL SMOOTHING OVERVIEW This lecure inroduces ime-series smoohing forecasing mehods. Various models are discussed,

More information

Vector Autoregressions (VARs): Operational Perspectives

Vector Autoregressions (VARs): Operational Perspectives Vecor Auoregressions (VARs): Operaional Perspecives Primary Source: Sock, James H., and Mark W. Wason, Vecor Auoregressions, Journal of Economic Perspecives, Vol. 15 No. 4 (Fall 2001), 101-115. Macroeconomericians

More information

Optimal Investment and Consumption Decision of Family with Life Insurance

Optimal Investment and Consumption Decision of Family with Life Insurance Opimal Invesmen and Consumpion Decision of Family wih Life Insurance Minsuk Kwak 1 2 Yong Hyun Shin 3 U Jin Choi 4 6h World Congress of he Bachelier Finance Sociey Torono, Canada June 25, 2010 1 Speaker

More information

Tax Externalities of Equity Mutual Funds

Tax Externalities of Equity Mutual Funds Tax Exernaliies of Equiy Muual Funds Joel M. Dickson The Vanguard Group, Inc. John B. Shoven Sanford Universiy and NBER Clemens Sialm Sanford Universiy December 1999 Absrac: Invesors holding muual funds

More information

Chapter 8 Student Lecture Notes 8-1

Chapter 8 Student Lecture Notes 8-1 Chaper Suden Lecure Noes - Chaper Goals QM: Business Saisics Chaper Analyzing and Forecasing -Series Daa Afer compleing his chaper, you should be able o: Idenify he componens presen in a ime series Develop

More information

cooking trajectory boiling water B (t) microwave 0 2 4 6 8 101214161820 time t (mins)

cooking trajectory boiling water B (t) microwave 0 2 4 6 8 101214161820 time t (mins) Alligaor egg wih calculus We have a large alligaor egg jus ou of he fridge (1 ) which we need o hea o 9. Now here are wo accepable mehods for heaing alligaor eggs, one is o immerse hem in boiling waer

More information

Appendix A: Area. 1 Find the radius of a circle that has circumference 12 inches.

Appendix A: Area. 1 Find the radius of a circle that has circumference 12 inches. Appendi A: Area worked-ou s o Odd-Numbered Eercises Do no read hese worked-ou s before aemping o do he eercises ourself. Oherwise ou ma mimic he echniques shown here wihou undersanding he ideas. Bes wa

More information

Term Structure of Prices of Asian Options

Term Structure of Prices of Asian Options Term Srucure of Prices of Asian Opions Jirô Akahori, Tsuomu Mikami, Kenji Yasuomi and Teruo Yokoa Dep. of Mahemaical Sciences, Risumeikan Universiy 1-1-1 Nojihigashi, Kusasu, Shiga 525-8577, Japan E-mail:

More information

4. International Parity Conditions

4. International Parity Conditions 4. Inernaional ariy ondiions 4.1 urchasing ower ariy he urchasing ower ariy ( heory is one of he early heories of exchange rae deerminaion. his heory is based on he concep ha he demand for a counry's currency

More information

Fair Stateless Model Checking

Fair Stateless Model Checking Fair Saeless Model Checking Madanlal Musuvahi Shaz Qadeer Microsof Research {madanm,[email protected] Absrac Saeless model checking is a useful sae-space exploraion echnique for sysemaically esing complex

More information

Life insurance cash flows with policyholder behaviour

Life insurance cash flows with policyholder behaviour Life insurance cash flows wih policyholder behaviour Krisian Buchard,,1 & Thomas Møller, Deparmen of Mahemaical Sciences, Universiy of Copenhagen Universiesparken 5, DK-2100 Copenhagen Ø, Denmark PFA Pension,

More information

9. Capacitor and Resistor Circuits

9. Capacitor and Resistor Circuits ElecronicsLab9.nb 1 9. Capacior and Resisor Circuis Inroducion hus far we have consider resisors in various combinaions wih a power supply or baery which provide a consan volage source or direc curren

More information

The effect of demand distributions on the performance of inventory policies

The effect of demand distributions on the performance of inventory policies DOI 10.2195/LJ_Ref_Kuhn_en_200907 The effec of demand disribuions on he performance of invenory policies SONJA KUHNT & WIEBKE SIEBEN FAKULTÄT STATISTIK TECHNISCHE UNIVERSITÄT DORTMUND 44221 DORTMUND Invenory

More information

Default Risk in Equity Returns

Default Risk in Equity Returns Defaul Risk in Equiy Reurns MRI VSSLOU and YUHNG XING * BSTRCT This is he firs sudy ha uses Meron s (1974) opion pricing model o compue defaul measures for individual firms and assess he effec of defaul

More information

How To Calculate Price Elasiciy Per Capia Per Capi

How To Calculate Price Elasiciy Per Capia Per Capi Price elasiciy of demand for crude oil: esimaes for 23 counries John C.B. Cooper Absrac This paper uses a muliple regression model derived from an adapaion of Nerlove s parial adjusmen model o esimae boh

More information

STABILITY OF LOAD BALANCING ALGORITHMS IN DYNAMIC ADVERSARIAL SYSTEMS

STABILITY OF LOAD BALANCING ALGORITHMS IN DYNAMIC ADVERSARIAL SYSTEMS STABILITY OF LOAD BALANCING ALGORITHMS IN DYNAMIC ADVERSARIAL SYSTEMS ELLIOT ANSHELEVICH, DAVID KEMPE, AND JON KLEINBERG Absrac. In he dynamic load balancing problem, we seek o keep he job load roughly

More information

Forecasting, Ordering and Stock- Holding for Erratic Demand

Forecasting, Ordering and Stock- Holding for Erratic Demand ISF 2002 23 rd o 26 h June 2002 Forecasing, Ordering and Sock- Holding for Erraic Demand Andrew Eaves Lancaser Universiy / Andalus Soluions Limied Inroducion Erraic and slow-moving demand Demand classificaion

More information

Does Option Trading Have a Pervasive Impact on Underlying Stock Prices? *

Does Option Trading Have a Pervasive Impact on Underlying Stock Prices? * Does Opion Trading Have a Pervasive Impac on Underlying Soc Prices? * Neil D. Pearson Universiy of Illinois a Urbana-Champaign Allen M. Poeshman Universiy of Illinois a Urbana-Champaign Joshua Whie Universiy

More information

Acceleration Lab Teacher s Guide

Acceleration Lab Teacher s Guide Acceleraion Lab Teacher s Guide Objecives:. Use graphs of disance vs. ime and velociy vs. ime o find acceleraion of a oy car.. Observe he relaionship beween he angle of an inclined plane and he acceleraion

More information

COMPUTATION OF CENTILES AND Z-SCORES FOR HEIGHT-FOR-AGE, WEIGHT-FOR-AGE AND BMI-FOR-AGE

COMPUTATION OF CENTILES AND Z-SCORES FOR HEIGHT-FOR-AGE, WEIGHT-FOR-AGE AND BMI-FOR-AGE COMPUTATION OF CENTILES AND Z-SCORES FOR HEIGHT-FOR-AGE, WEIGHT-FOR-AGE AND BMI-FOR-AGE The mehod used o consruc he 2007 WHO references relied on GAMLSS wih he Box-Cox power exponenial disribuion (Rigby

More information

ARCH 2013.1 Proceedings

ARCH 2013.1 Proceedings Aricle from: ARCH 213.1 Proceedings Augus 1-4, 212 Ghislain Leveille, Emmanuel Hamel A renewal model for medical malpracice Ghislain Léveillé École d acuaria Universié Laval, Québec, Canada 47h ARC Conference

More information

Market Efficiency or Not? The Behaviour of China s Stock Prices in Response to the Announcement of Bonus Issues

Market Efficiency or Not? The Behaviour of China s Stock Prices in Response to the Announcement of Bonus Issues Discussion Paper No. 0120 Marke Efficiency or No? The Behaviour of China s Sock Prices in Response o he Announcemen of Bonus Issues Michelle L. Barnes and Shiguang Ma May 2001 Adelaide Universiy SA 5005,

More information

GUIDE GOVERNING SMI RISK CONTROL INDICES

GUIDE GOVERNING SMI RISK CONTROL INDICES GUIDE GOVERNING SMI RISK CONTROL IND ICES SIX Swiss Exchange Ld 04/2012 i C O N T E N T S 1. Index srucure... 1 1.1 Concep... 1 1.2 General principles... 1 1.3 Index Commission... 1 1.4 Review of index

More information

A Two-Account Life Insurance Model for Scenario-Based Valuation Including Event Risk Jensen, Ninna Reitzel; Schomacker, Kristian Juul

A Two-Account Life Insurance Model for Scenario-Based Valuation Including Event Risk Jensen, Ninna Reitzel; Schomacker, Kristian Juul universiy of copenhagen Universiy of Copenhagen A Two-Accoun Life Insurance Model for Scenario-Based Valuaion Including Even Risk Jensen, Ninna Reizel; Schomacker, Krisian Juul Published in: Risks DOI:

More information