Sketchng Sampled Data Streams Florn Rusu, Aln Dobra CISE Department Unversty of Florda Ganesvlle, FL, USA frusu@cse.ufl.edu adobra@cse.ufl.edu Abstract Samplng s used as a unversal method to reduce the runnng tme of computatons the computaton s performed on a much smaller sample and then the result s scaled to compensate for the dfference n sze. Sketches are a popular approxmaton method for data streams and they proved to be useful for estmatng frequency moments and aggregates over jons. A possblty to further mprove the tme performance of sketches s to compute the sketch over a sample of the stream rather than the entre data stream. In ths paper we analyze the behavor of the sketch estmator when computed over a sample of the stream, not the entre data stream, for the sze of jon and the self-jon sze problems. Our analyss s developed for a generc samplng process. We nstantate the results of the analyss for all three major types of samplng Bernoull samplng whch s used for load sheddng, samplng wth replacement whch s used to generate..d. samples from a dstrbuton, and samplng wthout replacement whch s used by onlne aggregaton engnes and compare these partcular results wth the results of the basc sketch estmator. Our expermental results show that the accuracy of the sketch computed over a small sample of the data s, n general, close to the accuracy of the sketch estmator computed over the entre data even when the sample sze s only 0% or less of the dataset sze. Ths s equvalent to a speed-up factor of at least 0 when updatng the sketch. I. INTRODUCTION Data streamng has receved a lot of attenton from the research communty n the last decade. The requrement to process fast data streams motvates the need for approxmaton methods that make use of both small space and small tme. AGMS sketches, and ther mproved varant F-AGMS sketches 3, 4 proved to be a vable soluton for estmatng aggregates over jons. The man strengths of the sketchng technques are the smple and fast update procedure, the small memory requrement, and provable error guarantees. When the data streams that need to be processed are extremely fast, for example n the case of networkng data or large datasets streamed over the Internet, t s desrable to further reduce the update tme of sketches n order to acheve the requred processng rates. Samplng s a unversal method for data reducton and, n prncple, t can be used to reduce the amount of data that needs to be sketched. If samples are sketched nstead of the orgnal data, an mmedate update tme reducton results. Ths s smlar to the exstng load sheddng technques employed n data stream processng engnes 5. The man concern when samples rather than the orgnal data are sketched s how to extend the error guarantees sketches provde to ths new stuaton. The formulas resultng from such an analyss could be used to determne how aggressve the load sheddng can be wthout a sgnfcant loss n the accuracy of the sketch over samples estmator. A seemngly unrelated, but, as shown n the paper, techncally related, problem s analyzng streams of samples from unknown dstrbutons. Samples from unknown dstrbutons the so called..d. samples are the nput to most of the onlne data-mnng algorthms 6. In ths case the samples are not used as a data reducton technque, but rather they are the only nformaton avalable about the unknown dstrbuton. A fundamental problem n ths context s how to characterze the unknown dstrbuton usng only the samples. Ths s one of the fundamental problems n statstcs 7. If the samples are streamed, as s the case n onlne data-mnng, the am s to characterze the unknown dstrbuton by usng small space only, thus makng sketches a natural canddate for computatons that nvolve aggregates. It s a smple matter to use sketches n order to estmate aggregates over the samples. If predctons about the unknown dstrbuton need to be made, the problem s sgnfcantly more dffcult. Interestngly, ths problem s mathematcally smlar to the load sheddng problem n whch samplng s used to reduce the update tme of sketchng. A thrd problem s sketchng tuples that are processed by an onlne aggregaton engne n order to compute statstcs useful for decson makng 8, 9. In ths paper we analyze the sketch over samples estmator for generc samplng. Then we nstantate the results for three dfferent types of samplng. Our techncal contrbutons are: We provde a generc analyss of the sketch over samples estmator. The analyss conssts n expressng the frst two frequency moments of the estmator n terms of the moments of the samplng frequency random varables. We nstantate the results for sketchng Bernoull samples. Ths mmedately ndcates how random load sheddng for sketchng data streams behaves. We nstantate the generc analyss for sketchng samples wth replacement from a large populaton. The analyss generalzes to sketchng..d. samples from an unknown dstrbuton. The ablty to sketch..d. data s mportant f sketches are to be used for data-mnng applcatons. We nstantate the generc analyss for sketchng samples wthout replacement. Such samples are processed by onlne aggregaton engnes. By sketchng the samples,
mportant statstcs can be derved wth lttle computatonal overhead. We present emprcal evdence that the analyss s necessary snce the error of the sketch over samples estmator s not smply the sum of the errors of the two ndvdual estmators. The nteracton, whch s predcted by the analyss, plays a major role. The experments also pont out that n the majorty of the cases a 0% sample results n mnmal error degradaton the sketchng of streams can thus be sped-up by a factor of 0. In the rest of the paper, we ntroduce the formal problem n Secton II. Secton III gves an overvew on samplng whle Secton IV ntroduces sketches. The formal analyss of the combned sketch over samples estmator s detaled n Secton V. We dscuss possble applcatons of sketchng sampled data n Secton VI. The emprcal evaluaton of the combned estmator s presented n Secton VII, whle Secton VIII concludes the paper. Related Work There exsts a large body of work on approxmate query processng methods. The dea of combnng two estmators to captalze on the strengths of both s not new. F-AGMS sketches 3 are essentally a combnaton of random hstograms and AGMS sketches. 0 presents a method to buld ncremental hstograms from samples. To the best of our knowledge, sketchng and samplng have not been combned n a prncpled fashon before. The man dffculty n characterzng sketches over samples s the fact that the samplng analyss 8, s performed n the tuple doman whle the sketch analyss s performed n the frequency doman. Ths s the frst obstacle we overcome n ths paper. The work on sketchng probablstc data streams, 3 s somehow smlar to our work. The mportant dfference s the fact that samplng s part of the estmate n our work whle t represents only a way to nterpret the probablstc data n the related work. The results n do not characterze the sketch over sample estmator but approxmate the probablstc aggregates usng sketches. The only overlap n terms of analyss seems to be the computaton of the expected value of sketch over samples for the second frequency moment computaton n 3. 4 presents an alternatve method to mprove the sketchng rate of a data stream by determnstcally skppng stream tems. In our work the tems that are sketched are randomly selected through a samplng process. II. PRELIMINARIES The general problem we dscuss throughout the paper s how to approxmate the sze of jon of two relatons and the self-jon sze or second frequency moment of a relaton. Let F and G be two streamng relatons, each havng a sngle attrbute A, wth doman I. Furthermore, let f and g be the frequency of the value n F and G, respectvely. Wth ths, the sze of jon of relatons F and G can be wrtten as the dot-product of ther frequency vectors: F A G = f g () When relatons F and G are dentcal, the quantty f s known as the self-jon sze of F. In order to compute the sze of jon exactly, the frequency vectors of the two relatons have to be stored n full. Ths requres space proportonal to the doman of the jonng attrbute A, whch s nfeasble for large domans, e.g., I = 64. Thus, randomzed solutons wth reduced space requrements and provable error guarantees have been proposed. The standard technques 7, 5 to derve error guarantees or confdence ntervals for an estmate s to compute the expected value and the varance and then to use ether dstrbuton-ndependent bounds gven by Chernoff s and Chebyshev s nequaltes, or to use dstrbuton-dependent bounds. In the latter case, usually the Central Lmt Theorem or one of ts generalzatons s used to argue that the dstrbuton of the estmate has a partcular shape, and then error bounds based on the assumed dstrbuton wth the same expected value and varance are computed. In order to smplfy the exposton, we provde results n the form of expected values and varances throughout the paper. Actual error guarantees can be obtaned straghtforwardly usng the mentoned technques. III. SAMPLING Samplng as an approxmaton technque conssts n obtanng samples F and G from relatons F and G, respectvely, computng the sze of jon aggregate over the samples, and applyng a correcton to ensure that the samplng estmator s unbased. Ths method s generc and apples to all types of samplng. To smplfy the theoretcal exposton, we keep the treatment of samplng as generc as possble. In Secton II we expressed the sze of jon aggregate as a functon of f and g, the frequences of value of the jon attrbute n relatons F and G, respectvely. If we defne f and g to be the frequences of n F and G, respectvely, the sze of jon of the sample relatons s: F A G = f g () f and g are random varables that depend on the type of samplng and the parameters of the samplng process. Interestngly, a large part of the characterzaton of samplng can be carred out wthout specfyng the type of samplng. Ths s also true for sketches over samples n Secton V. A. Generc Samplng In general, F A G s not an unbased estmator for the sze of jon F A G. Fortunately, n the majorty of the cases, a constant correcton that scales for the dfference n sze between the samples and the orgnal relatons can be made to obtan an unbased estmator. If we defne the estmator as X = C f g, where C s the scalng factor, we can determne the value of C such that X s unbased. In
order to derve error bounds for the estmator, the expectaton E X and the varance X have to be computed. It turns out that expressons for E X and X can be wrtten for generc samplng n terms of the moments of the frequency random varables f and g. There are two dstnct cases that need separate treatment. The frst case s when relatons F and G are dfferent and the samples are obtaned ndependently from the two relatons. The second case s when F and G are dentcal, thus only one sample s avalable. Ths stuaton arses n the case of self-jon sze. When F and G are obtaned ndependently, the random varables f and g are also ndependent. Proposton (Sze of Jon): Let X = C f g be the estmator for the sze of jon defned over the generc samples F and G. Then: E X = C E f E g X = C E ( ) f f j E g g j E f E g j I When F and G are dentcal and only the sample F s avalable, the random varables f and g are also dentcal. Proposton (Self-Jon Sze): Let X = C f be the estmator for the self-jon sze defned over the generc sample F. Then the expectaton and the varance of X are gven by: E X = C E f X = C E ( f f j E f ) (4) j I A specfc type of samplng determnes the values of the expectatons that appear n the formulas. These expectatons are moments of the frequency random varables. They can be derved from the moment generatng functon correspondng to the samplng process. For the types of samplng we consder n ths paper the moment generatng functon s well known (see for example 6). Dervng fnal formulas for E X and X and determnng the constant C mght seem just a matter of pluggng n these quanttes for a specfc type of samplng, but the actual process s ntrcate because the frequency moments have to be separately computed before the algebrac manpulatons are carred out. The advantage of analyzng samplng n the frequency doman, as we do n ths secton, s that t allows the analyss to be extended to sketches over samples. Samplng estmators lke the ones consdered here have been analyzed n the publshed lterature (the theory n 9 provdes the analyss for all types of smple unform samplng and for an arbtrary number of relatons), but the analyss s n the tuple doman not the frequency doman. Interestngly, the analyss n the frequency doman s smpler than the analyss n the tuple (3) doman snce, as we show n ths paper, the frequency random varables have easly dentfable dstrbutons for whch the moment generatng functons are avalable. Dervng formulas for expected value and varance becomes just a matter of carryng the necessary algebrac manpulatons. We consder three types of samplng: Bernoull samplng, samplng wth replacement, and samplng wthout replacement. We derve the formulas for expectaton and varance as a functon of the samplng frequences both for samplng and for sketchng over samples (Secton V). Ths parallel treatment smplfes the nterpretaton of the complex results obtaned for sketches over samples. B. Bernoull Samplng When the samplng process s Bernoull, each tuple n F and G s selected ndependently n the sample F and G wth probablty p or q, 0 p, 0 q, respectvely. Then, f and g are ndependent bnomal random varables 6, f = Bnomal(f, p) and g = Bnomal(g, q), respectvely, wth expected values: E f = pf, E g = qg (5) The scalng factor for the sze of jon estmator s n ths case C = pq. The expectaton and the varance for Bernoull samplng can be derved n a straghtforward manner usng the frequency moments of the bnomal random varables correspondng to the samplng frequences. Proposton 3 (Sze of Jon): Let X = pq f g be the estmator for the sze of jon defned over the Bernoull samples F and G. Then the expectaton and the varance of X are gven by: E X = f g X = p f g + q p q f g + ( p)( q) pq f g The stuaton s more complcated for self-jon sze because the generc estmator X = C f has a bas that cannot be corrected by smply multplyng wth a scalng factor. Nevertheless, the generc formula for the varance n Equaton (4) s stll applcable. Proposton 4 (Self-Jon Sze): Let X = p f p p f be the estmator for self-jon sze defned over the Bernoull sample F. Then: E X = f X = p p 3 4p f 3 + p( 3p) f p( 3p) f (7) (6)
C. Notaton for Samplng Coeffcents In order to wrte compact formulas for samplng wth and wthout replacement, we use the followng notaton throughout the rest of the paper: α = F F, α = F F, α = F F β = G G, β = G G, β = G G α and β are the samplng fractons from relatons F and G, respectvely. α, α, β, β are just small varatons that appear n formulas. D. Samplng wth replacement A sample of fxed sze can be generated by repeatedly choosng a random tuple from the base relaton for the specfed number of tmes. If the same tuple can appear n the sample multple tmes, the process s samplng wth replacement. In ths case the random varables correspondng to the frequences n the sample, f and g, respectvely, are the components of a multnomal random varable 6 wth parameters the sze of the sample and the probablty f g G (8) F and, respectvely, where F and G are the sze of F and G. Snce each component of a multnomal random varable s a bnomal random varable, the expectatons n Equaton 5 stll hold but wth dfferent probabltes: E f = αf, E g = βg (9) The exact formulas for expectaton and varance can be derved as for Bernoull samplng. The moments of a multnomal random varable have to be used nstead. Proposton 5 (Sze of Jon): Let X = αβ f g be the estmator for the sze of jon defned over the samples wth replacement F and G. Then the expectaton and the varance of X are gven by: E X = f g X = f g + F αβ f g αβ + G α β ( ) f g + (α β αβ) f g (0) An unbased estmator for self-jon sze defned over the sample wth replacement F s X = αα f α F. Notce that the estmator depends only on the sze of the base relaton and the sze of the sample. X can be derved from the formula of the varance for generc samplng n Equaton (4). We omt the actual formula here due to lack of space. E. Samplng wthout replacement A sample wthout replacement from a relaton conssts of a random subset of tuples selected from the relaton. The dfference between samplng wth replacement and samplng wthout replacement s that a tuple can appear at most once n a sample wthout replacement whle t can appear multple tmes n a sample wth replacement. In ths case the random varables correspondng to the frequences n the sample, f and g, respectvely, are the components of a multvarate hypergeometrc random varable 6. In order to derve the exact formulas for expectaton and varance, the actual moments of the multvarate hypergeometrc dstrbuton have to be plugged n. Proposton 6 (Sze of Jon): Let X = αβ f g be the estmator for the sze of jon defned over the samples wthout replacement F and G. Then the expectaton and the varance of X are gven by: E X = f g X = ( α )( β ) f g + ( α )β f g αβ +α ( β ) ( ) f g + (α β αβ) f g () The only dfference between samplng wth replacement and samplng wthout replacement s the coeffcents of the terms appearng n the varance formula. Whle the varance of samplng wthout replacement becomes 0 when the entre relaton s sampled, the varance of samplng wth replacement never becomes 0. An unbased estmator for self-jon sze defned over the sample wthout replacement F s X = α α αα f F. The varance of X can be derved from the formula of the varance for generc samplng n Equaton (4). We do not nclude the actual formula here. IV. SKETCHES Whle samplng technques select a random subset of tuples from the nput relaton, sketchng technques summarze all the tuples as a small number of random varables. Ths s accomplshed by projectng the doman of the nput relaton on a sgnfcantly smaller doman usng random functons. Multple sketchng technques are proposed n the lterature for estmatng the sze of jon and the second frequency moment (see 4 for detals). Although usng dfferent random functons,.e., {+, } or hashng, the exstng sketchng technques have smlar analytcal propertes,.e., the sketch estmators have the same varance. For ths reason we focus on the basc AGMS sketches, throughout the paper. The basc AGMS sketch of relaton F conssts of a sngle random varable S that summarzes all the tuples t from F. S s defned as: S = ξ t.a = f ξ () t F where ξ s a famly of {+, } random varables that are 4 wse ndependent. Essentally, a random value of ether + or s assocated to each pont n the doman of attrbute A.
Then, the correspondng random value s added to the sketch S for each tuple t n the relaton. We can defne a sketch T for relaton G n a smlar way and usng the same famly ξ. Proposton 7 (Sze of Jon): The sketch-based estmator X defned as: X = S T = f ξ g j ξ j (3) j I s an unbased estmator for the sze of jon F A G. The varance of the sketch estmator s gven by: X = ( ) f gj + f g f g (4) j I Proposton 8 (Self-Jon Sze): The unbased estmator for the self-jon sze s defned as: X = S = f f j ξ ξ j (5) j I The varance of the sketch estmator s gven by: ( ) X = (6) A common technque to reduce the varance of an estmator s to generate multple ndependent nstances of the basc estmator and then to buld a more complex estmator as the average of the basc estmators. Whle the expected value of the complex estmator s equal wth the expectaton of one basc estmator, the varance s reduced by a factor of = n snce n n k= X k = X k n, where n s the number of basc estmators beng averaged. Ths technque can be appled to reduce the varance of the sketch estmator f dfferent famles ξ are used for the basc estmators (see, for detals). n n k= X k f f 4 V. SKETCHES OVER SAMPLES Gven the ablty of samplng to make predctons about an entre dataset from a randomly selected subset and that sketches requre the entre dataset n order to determne any of ts propertes, an nterestng queston that mmedately arses s how to combne these two randomzed technques. Although the ntutve answer to ths queston seems to be smple the sketch s computed over a sample of the data nstead of the entre dataset the behavor of the combned estmator s not the smple composton of the ndvdual behavor of the ngredents. A careful analytcal characterzaton of the estmator needs to be carred out. Furthermore, the samplng process can be ether explct and executed as an ndvdual step before sketchng s done, or mplct, stuaton n whch the nput dataset s assumed to be a sample from a large populaton. In the frst case, a sgnfcant speedup n updatng the sketch structure can be obtaned snce only a random subset of the data s actually sketched. Ths process s essentally a load sheddng technque for sketchng extremely fast data streams that cannot be otherwse sketched. It can be mplemented as an explct Bernoull samplng that randomly flters the tuples that update the sketch structure. In the second case, the data s assumed to be a sample from a large populaton and the goal s to determne propertes of the populaton based on the sample. The sample tself s assumed to be large enough so t cannot be stored explctly, thus sketchng s requred. If the populaton s nfnte, the entre process can be seen as sketchng..d. samples from an unknown dstrbuton. We provde the analyss both for samplng wth replacement and samplng wthout replacement. The analyss straghtforwardly extends to..d. samples f all estmators are normalzed by the sze of the populaton and the lmt, when the populaton sze goes to nfnty, s taken. In such a crcumstance, the frequences n the orgnal unknown populaton become denstes of the unknown populaton, but everythng else remans the same. In ths secton, we provde a generc framework for sketchng sampled data streams n order to estmate the sze of jon and the self-jon sze. Then we compute the frst two frequency moments of the combned estmator for the most common types of samplng Bernoull samplng, samplng wth replacement, and samplng wthout replacement. Ths provdes suffcent nformaton to allow the dervaton of confdence bounds for the combned estmator. A. Sketches over Generc Samplng Consder F to be a generc sample obtaned from relaton F. Sketchng the sample F s smlar to sketchng the entre relaton F and conssts n summarzng the sampled tuples t as follows: S = ξ t.a = f ξ (7) t F where ξ s a famly of {+, } random varables that are 4 wse ndependent. A sample G from relaton G can be sketched n a smlar way usng the same famly ξ: T = ξ t.a = g ξ t G Sze of Jon We defne the estmator X for the sze of jon F A G based on the sketches computed over the samples as follows: X = C ST = C f ξ g jξ j (8) Notce that the estmator s smlar to the sketch estmator computed over the entre dataset n Proposton 7 multpled wth a constant scalng factor C that compensates for the dfference n sze. Self-Jon Sze The self-jon sze or second frequency moment of a relaton s the partcular case of sze of jon between two nstances of the same relaton. One way of analyzng the sketches over samples estmator for the self-jon sze problem s to buld two ndependent samples and two ndependent sketches from the same base relaton and then to apply the results correspondng to sze of jon. Although sound from an analytcal pont of j I
vew, ths soluton s neffcent n practce. In the followng we consder a practcal soluton that requres the constructon of only one sample and one sketch from the base relaton. A new estmator for the self-jon sze has to be defned nstead, but the analyss s closely related to the analyss of the sze of jon estmator. Wth S defned n Equaton 7, we defne the self-jon sze estmator X as follows: ( ) X = S = C f ξ = C f ξ f jξ j (9) where C s the same scalng factor compensatng for the dfference n sze. Notce that the dfference between the sze of jon estmator and the self-jon sze estmator s only at the samplng level snce the same famly of ξ random varables s used for sketchng n both cases. For ths reason we carry out the analyss for the two estmators n parallel and make the dstncton only when necessary. In order to derve confdence bounds for the estmator X, the frst two moments, expected value and varance, have to be computed. Intutvely, the scalng factor C should compensate for the dfference n sze and make the estmator unbased. Snce the two processes, samplng and sketchng, are ndependent and sequental, the nteracton between them s mnmal and the sum of the two varances should be a good estmator for the varance of the combned estmator. In the followng, we derve the exact formulas for the expectaton and the varance n the generc case. The ndependence of the famles of random varables correspondng to samplng and sketchng, f, g and ξ, respectvely, plays an mportant role n smplfyng the computaton. Ths ndependence s due to the ndependence of the two random processes. Whle the computaton of the expectaton s straghtforward, the computaton of the varance s more ntrcate snce the nteracton between sketchng and samplng s more complex and t can be characterzed only through a detaled analyss. The frst step n our analyss s to derve the formulas for the moments of the basc estmator. Proposton 9 and 0 characterze the behavor of the basc sketch over samples estmator. Proposton 9 (Sze of Jon): Let the sketch over samples estmator for the sze of jon to be defned as X = C f ξ j I g j ξ j. Then, the expectaton and the varance of X are gven by: E X = C E f E g X = C E f j I E g j + j I j I E ( f E g E f E g E f f j E g g j ) (0) Proposton 0 (Self-Jon Sze): Let the sketch over samples estmator for the self-jon sze to be defned as C f ξ j I f j ξ j. Then, the expectaton and the varance of X are gven by: E X = C E f X = C 3 E f f j j I ( E f ) E f 4 () Notce that the varance of sketchng over generc samplng s an expresson dependng only on the propertes of the samplng process. More precsely, n order to evaluate the varance, only expectatons of the form E f and E f f j have to be computed, where f and f j are random varables correspondng to the frequences n the sample. The averagng technque appled to reduce the varance of basc sketches n Secton IV cannot be used straghtforwardly n the case of sketches computed over samples. Ths s the case snce, although the basc sketch estmators are bult ndependently usng dfferent ξ famles of random varables, they are computed over the same sample and ths ntroduces correlatons between any two estmators. The varance of the average estmator s n ths case: n n X k = n X k + (n ) Cov k l X k, X l k= () where n s the number of basc estmators beng averaged and Cov X k, X l = E X k X l E X k E X l s the covarance between any two basc estmators. The next step n our analyss s to derve the formulas for the varance of the average sketch over samples estmator. Proposton and contan the derved formulas. Proposton (Sze of Jon): The varance of the average sketch over samples sze of jon estmator s gven by: n n X k = k= C E ( ) f f j E g g j E f E g j I + n E f j I E g j + j I E f f j E g g j E f ) E g (3) Essentally, the varance of the average estmator s the sum of the varance of the generc samplng estmator n Equaton 3, the varance of the sketch estmator n Equaton 4, and a
term correspondng to the nteracton between the two random processes. For the partcular types of samplng consdered n ths work, we derve the exact formula of the nteracton term. Notce that the mprovement obtaned by averagng s less sgnfcant than a factor of n obtaned n the case of ndependent estmators. Proposton (Self-Jon Sze): The varance of the average sketch over samples self-jon sze estmator s gven by: n n X k = C k= j I ( E f ) + n j I E f f j E f f j E f 4 (4) The ndependence of sketchng and samplng plays an mportant role n dervng the formulas for expectaton and varance. The ndependence has the effect of factorzng the expectatons over products of random varables correspondng to sketchng and samplng. Thus, only expectatons nvolvng the samplng frequency random varables appear n the fnal formulas snce the expectatons correspondng to sketches are ether 0 or,.e., E ξ ξ j = E ξ E ξ j = 0 whenever j due to the 4 wse ndependence of the famly ξ, and E ξ =. Usng these equaltes and some complex algebrac manpulatons, the gven formulas are obtaned. The domnant factor that smplfes the analyss s the modelng of samplng as frequency random varables. Ths s our man contrbuton. B. Bernoull Samplng We nstantate the formulas derved for generc samplng n Secton V-A wth the moments of the bnomal random varables correspondng to the Bernoull samplng frequences. Proposton 3 (Sze of Jon): Let the unbased sketch over Bernoull samples sze of jon estmator to be defned as X = pq f ξ j I g j ξ j. Then, the varance of the average estmator s gven by: n n X k = k= p f g + q f ( p)( q) g + f g p q pq + ( ) f gj + f g f g n j I + p f gj + q f g j n p q j I,j j I,j ( p)( q) + f g j pq j I,j (5) Proposton 4 (Self-Jon Sze): The varance of the average unbased self-jon sze estmator X = p ( f ξ ) s gven by: p p f n p p 3 n k= X k p p f = 4p f 3 + p( 3p) ( ) + f f 4 n + n ( p) p j I,j f f j + f p( 3p) ( p) p f j I,j f f j (6) The varance of the average estmator s, as derved for generc samplng, the sum of the average sketch estmator ndvdual varance, the Bernoull samplng estmator ndvdual varance, and an nteracton term. Snce dervng an exact analytcal relaton between these terms s a dauntng task, we consder some extreme scenaros that allow a partal characterzaton. Frst notce that n, the number of estmators beng averaged, can be gnored when comparng the sketch varance and the nteracton varance because t has the same effect on both (the varance s reduced by a factor of n). When the dstrbuton of the frequences s unform, the nteracton varance s the domnant term whenever the unque frequency has a smaller value than the sze of the doman I. At the other extreme, when the dstrbuton of the frequences s skewed, the sketch varance s the domnant term by far. These results suggest that the nteracton varance could represent a problem for unform-lke data. Ths s not necessarly the case because the value of the varance for unform dstrbutons s sgnfcantly smaller than for skewed data, thus, although the nteracton varance s the domnant term, ts absolute value s not large. To better understand the exact sgnfcance of each of the terms appearng n the varance and to confrm the analyss for the extreme cases, we desgned a set of smulatons to determne the relatve contrbuton of each of the terms. The expermental setup s descrbed n Secton VII. Fgure and depct the relatve contrbuton of each of the three terms appearng n the varance of the average estmator over Bernoull samples (Equaton 6 and 5). The relatve contrbuton s represented as a functon of the data skew for dfferent samplng probabltes. A common trend both for sze of jon and self-jon sze s that the nteracton term s hghly sgnfcant for low skew data. Ths completely justfes the analyss we develop throughout the paper snce an analyss assumng that the varance of the composed estmator s the sum of the varances of the basc estmators would be ncorrect. At the same tme, ths suggests that the accuracy of the sketch over samples estmator can be sgnfcantly
worse than the accuracy of the sketch estmator for nonskewed data. As already explaned, ths s not necessarly true. Moreover, the expermental results we provde n Secton VII show that ths s not the case for practcal scenaros. As expected, the mpact of the varance of the samplng estmator s more sgnfcant as the sze of the sample s smaller. For self-jon sze (Fgure ), the varance s domnated by the term correspondng to the samplng estmator, whle for sze of jon (Fgure ) the varance of the sketch estmator quantfes for almost the entre varance rrespectve of the samplng probablty. Ths s entrely supported by the exstng theoretcal results whch show that sketches are optmal for estmatng the second frequency moment whle samplng s optmal for the estmaton of sze of jon. C. Samplng wth replacement In a smlar way to Bernoull samplng, we nstantate the formula for the sze of jon estmator derved for generc samplng wth the moments of the multnomal random varables correspondng to the samplng frequences. The nteracton between two dfferent samplng frequences makes the dervaton of the formula more complcated. We do not provde the formula for self-jon sze varance due to space constrants. Proposton 5 (Sze of Jon): Let the unbased sketch over samples wth replacement sze of jon estmator to be defned as X = αβ f ξ j I g j ξ j. Then, the varance of the average estmator s gven by: n n X k k= = αβ f g + F αβ f g + G α β ( ) f g + (α β αβ) f g + α β ( ) f gj + f g f g n α β j I + f g j + F αβ f gj n αβ j I,j j I,j + G α β f g j j I,j (7) D. Samplng wthout replacement We apply the same procedure for samplng wthout replacement. We obtan smlar results to the other types of samplng the terms n the varance are the same, only the coeffcents are dependent on the samplng procedure. The formula for self-jon sze s not provded due to space lmtatons. Proposton 6 (Sze of Jon): Let the unbased sketch over samples wthout replacement sze of jon estmator to be defned as X = αβ f ξ j I g j ξ j. Then, the varance of the average estmator s gven by: n n X k = k= ( α )( β ) f g + ( α )β f g αβ +α ( β ) ( ) f g + (α β αβ) f g + α β ( ) f gj + f g f g n α β j I + ( α )( β ) f g j n αβ + ( α ) β E. Dscusson j I,j j I,j f g j + α ( β ) j I,j f g j (8) The result we derved n ths secton s somewhat surprsng: the varance of the combned sketch-samplng estmator can be wrtten as the sum of the varance of the sketch estmator, the varance of the samplng estmator, and an nteracton term. Ths separaton of the varance formula was accomplshed for all three types of samplng and both sze of jon and self-jon-sze problems. Although the sketch varance seems to be the domnant term, the exact sgnfcance of each of the terms s dependent on the actual dstrbuton of the data (see the expermental results for Bernoull samplng). When multple sketch estmators are averaged n order to decrease the varance, the covarance must also be consdered snce the sketch estmators are computed over the same sample, thus they are correlated. Our results show that the varance of the combned estmator does not decrease by a factor equal to the number of averages, only the sketch varance and the nteracton term do. VI. APPLICATIONS In ths secton, we dentfy applcatons for sketchng sampled data for each of the types of samplng dscussed n the paper. We also delve nto the algorthmc ssues correspondng to the combned randomzed process. A. Bernoull Samplng Even though sketchng can be mplemented for farly hghspeed data streams, the update tme could stll become the lmtng factor f all the tuples need to be sketched. In order to allevate ths problem, some of the tuples need to be dropped. Bernoull samplng provdes a prncpled approach to drop tuples and stll be able to characterze the result. Sketchng Bernoull samples s a method to further ncrease the rate of the data streams that can be sketched. Used along hashng, sketchng over Bernoull samples allows the processng of
Interacton Samplng Sketch Interacton Samplng Sketch ance terms dstrbuton 0.8 0.6 0.4 0. ance terms dstrbuton 0.8 0.6 0.4 0. 0 0 0.0 0.5.0.5.0.5 3.0 4.0 5.0 Zpf coeffcent / p 0.0 0.5.0.5.0.5 3.0 4.0 5.0 Zpf coeffcent / p Fg.. Sze of jon varance Fg.. Self-jon sze varance data streams havng arrval rates of bllons of tuples per second encountered n the exstng networkng equpment. The analyss n Secton V-B provdes a complete characterzaton of the error behavor for the overall process. The expermental results n Secton VII-A show that for a samplng fracton of % the decrease n accuracy over sketchng the entre data stream s nsgnfcant. The basc Bernoull samplng algorthm conssts n generatng a random number for each tuple n the dataset. The tuples for whch the random value s smaller than the samplng probablty are ncluded n the sample. The man drawback of Bernoull samplng s that the sze of the sample s unknown pror to runnng the process. Ths s not a problem anymore when the sample s sketched nstead of beng explctly stored. The algorthm for sketchng Bernoull samples s a smple extenson of the basc Bernoull samplng algorthm. The tuples that are selected n the sample are nserted nto the sketch data structure nstead of beng explctly stored. Snce updatng the sketch s an extremely fast operaton 7, t could be doubtful that the extra samplng s ndeed benefcal for speedng the entre process. Fortunately, there exst algorthms that avod the con tossng for each tuple by generatng the ntervals between the tuples that are selected n the sample 8. Ths way there s work to be done only for the tuples that are actually sampled and subsequently sketched. In ths stuaton the speedup over sketchng the entre data stream s clearly proportonal wth the samplng fracton. B. Samplng wth replacement Consder a generatve model that draws samples from a fnte populaton. The samples are draw wth replacement. A data stream contanng all the samples s generated. The objectve s to determne propertes, e.g., the second frequency moment, of the generatve model, or correlatons between two dfferent generatve models, e.g., the sze of jon, from the stream of samples. An addtonal requrement s that the stream s large enough that t cannot be stored n full. These knd of scenaros are frequent n data-mnng applcatons 6. Sketchng the stream of samples obtaned wth replacement from the fnte populaton whose sze s known represents a soluton to determnng propertes of the generatve model. The nput stream s the actual sample n ths case, so no explct samplng of the stream s requred. Thus, the standard updatng algorthm for sketches can be used n ths case. The estmaton algorthm s though dfferent because t has to take nto consderaton that the stream s only a sample. The formulas derved n Secton V-C have to be used for estmaton n ths case. The mportant queston n ths case s how accurate s the combned estmator. And how large has to be the sketched sample n order to obtan accurate estmatons. The expermental results n Secton VII-B show that for a samplng fracton of 0% the estmaton s accurate and stable. No sgnfcant ncrease n accuracy s obtaned f the sample sze s larger. C. Samplng wthout replacement The applcaton we have n mnd for sketchng samples obtaned wthout replacement s onlne aggregaton. In a tradtonal database engne, the exact answer to a query s provded only after the entre data s processed. The user does not get any clues about the result durng the query executon. Ths may take a long tme for complex queres over a datawarehouse. Onlne aggregaton has a dfferent strategy. Partal approxmate answers are provded to the user whle the query s processed by executng equvalent queres on a smaller fracton of the entre data and then scalng the result. As more data s processed, the accuracy of the approxmate result ncreases to the pont where the exact answer s returned (when the entre data s processed). The fundamental requrement for the partal results to be estmates of the fnal result s that the portons of the data the equvalent queres are executed on to represent random samples wthout replacement from the entre data. More detals on onlne aggregaton can be found n the lterature8,, 9. Sketchng samples obtaned wthout replacement represents a fast and nexpensve method to gather some of the statstcs
(second frequency moment, correlaton between attrbutes) used by an onlne aggregaton engne to take decsons and to compute the approxmate results. The dea s to buld sketches for the desred statstcs whle the relatons (materalzed or ntermedary results) are scanned. The fracton of the relaton seen at each pont durng the scan represents a sample wthout replacement of the entre relaton as long as the order of the tuples s random. More accurate estmates for the computed statstcs are avalable as the scannng advances. The goal s to obtan stable estmators as early as possble such that the onlne aggregaton engne takes the optmal decsons and provdes estmates as accurate as possble. The analyss n Secton V-D provdes a complete characterzaton of the combned estmators. In Secton VII-C we show expermental results for whch accurate estmates are avalable after only 0% of the relatons s scanned. From an algorthmc pont of vew, sketchng a relaton whle t s scanned ncurs almost no addtonal cost. The advantage over usng the samples to provde estmates s that the samples do not have to be explctly stored and processed. There s extra memory requred only to store the sketches. Sketchng can be done for arrval rates of tens of mllons of tuples per second wthout any tme penalty. Or t can be executed as a separate thread n parallel wth scannng, whch s necessary. On the modern mult-core processors, sketchng can be done essentally for free. VII. EXPERIMENTAL EVALUATION We pursue two man goals n the expermental evaluaton of the sketchng over samples estmators. Frst, we want to determne the behavor of the error of the sketch over samples estmator when compared wth the error of the sketch estmator. And second, we want to dentfy what s the behavor of the estmaton error as a functon of the sample sze. We desgn experments to determne these relatons for all three types of samplng presented n the paper. And both for the sze of jon and the self-jon sze problems. In order to accomplsh these goals, we desgned a seres of experments over both synthetc datasets and the TPC-H dataset. Synthetc datasets allow a better control of the mportant parameters that affect the results, whle the TPC-H dataset valdates the results for large scales. The synthetc datasets used n our experments contan ether 0 or 00 mllon tuples generated from a Zpfan dstrbuton wth the coeffcent rangng between 0 (unform) and 5 (skewed). The doman of the possble values s mllon. In the case of sze of jon, the tuples n the two relatons are generated completely ndependent. For the experments over the TPC-H dataset, we used the scale benchmark data. We used F-AGMS sketches 3 n all of the experments due to ther superor performance both n accuracy and update tme (see 4 for detals on sketchng technques). The number of buckets s ether 5, 000 or 0, 000. Ths s equvalent to averagng 5, 000 or 0, 000 basc estmators. In order to be statstcally sgnfcant, all the results presented n ths secton are the average of at least 00 ndependent experments. A. Bernoull Samplng estmaton true result The expermental relatve error,.e., true result, of the sketch over Bernoull samples estmator s depcted n Fgure 3 and 4 as a functon of the data skew for dfferent samplng probabltes. Probablty p =.0 corresponds to sketchng the entre dataset. These expermental results show that, wth some exceptons, the samplng rate does not sgnfcantly affect the accuracy of the sketch estmator. For Zpf coeffcents smaller than, n the case of self-jon sze, and smaller than 3, n the case of sze of jon, the error of the sketch estmator s almost the same both when the entre dataset s sketched or when only one tuple out of a thousand s sketched. The mpact of the samplng rate s sgnfcant only for hgh skew data n the case of self-jon sze. Ths s to be expected from the theoretcal analyss (Fgure ). What cannot be explaned from the theoretcal analyss s the effect of the samplng rate for skewed data n the case of sze of jon. As shown n 4, the expermental behavor of F-AGMS sketches s n some cases orders of magntude better than the theoretcal predctons, thus although the theoretcal varance s domnated by the varance of the sketch estmator, the emprcal absolute value s small when compared to the varance of the samplng estmator. In the lght of 4, the emprcal results for hgh samplng rates are much better than the theoretcal predctons, ncreasng thus the sgnfcance of the samplng rate for hghly skewed data. B. Samplng wth replacement In Fgure 5 and 6 we depct the expermental relatve error as a functon of the sample sze for samplng wth replacement. Snce the actual sze of the sample s dfferent for dfferent Zpf coeffcents, we represent on the x axs the sze of the sample as a fracton from the populaton sze, wth correspondng to a sample wth replacement of sze equal to the populaton sze. As expected, the error s decreasng as the sample sze becomes larger, but t stablzes after a certan sample sze (a fracton of the populaton sze for the ncluded fgures). Thus, sketchng more samples does not provde any ncrease n the accuracy after a certan pont. For the stuatons depcted n Fgure 5 and 6, the edge samplng fracton s around. C. Samplng wthout replacement We used the scale TPC-H data for our experments on sketchng samples wthout replacement. Fgure 7 depcts the error as a functon of the samplng rate for the sze of jon between the relatons lnetem and orders. In Fgure 8 we plot the error of the second frequency moment of relaton lnetem on the l orderkey attrbute. As expected, the error of the selfjon sze estmator decreases whle ncreasng the sample sze and t becomes stable for samplng rates larger than 0%. For sze of jon, the behavor of the error s somehow unexpected. The smallest error n Fgure 7 s obtaned for a samplng rate of 0%. Then, the error starts to ncrease whle ncreasng the samplng rate. Ths behavor s due to F-AGMS sketches.
0 p= p= p= p=.0 Relatve error (log scale) e-04 Relatve error (log scale) e-04 e-05 e-06 e-07 e-08 e-05 e-06 0 3 4 5 ZIPF coeffcent e-09 p= e-0 p= p= p=.0 e- 0 3 4 5 ZIPF coeffcent Fg. 3. Sze of jon error (Bernoull) Fg. 4. Self-jon sze error (Bernoull) Relatve errorv (log scale) ZIPF=0.0 ZIPF=0.5 ZIPF=.0 ZIPF=.0 Relatve errorv (log scale) ZIPF=0.0 ZIPF=0.5 ZIPF=.0 ZIPF=.0 Samplng rate (log scale) e-04 Samplng rate (log scale) Fg. 5. Sze of jon sample sze (WR) Fg. 6. Self-jon sze sample sze (WR) 0 00 Relatve error (log scale) Relatve error (log scale) 0 Samplng rate (log scale) Samplng rate (log scale) Fg. 7. Sze of jon sample sze (WOR) Fg. 8. Self-jon sze sample sze (WOR)
D. Extreme behavor of F-AGMS sketches An nterestng trend n Fgure 7 s the ncrease of the error as the samplng rate ncreases over 0%. Ths s counter ntutve snce we would expect smaller error as more data s sketched. Ths behavor s not predcted by the theory and t must be due to the fact that the theoretcal formulas are derved for AGMS sketches, but F-AGMS sketches are used n the experments. To understand why ths s happenng we have to reffer to the analyss of F-AGMS sketches n 4. F-AGMS sketches use a combnaton of hashes and AGMS sketches wthn each hash bucket. As more data s sketched, the contenton n buckets ncreases and ths produces a wder varance of the estmates. Ths suggests that n some stuatons t s better to sketch only a sample of the data rather than the entre data for F-AGMS sketches. E. Dscusson We provde expermental results for each of the types of samplng presented n the paper. The goal s to compare the combned sketch over samples estmator wth the sketch estmator and to determne ther relaton. Our experments for Bernoull samplng show that a sgnfcant speed-up (a factor of 0 n general and a factor of up to 000 n some cases) can be obtaned by sketchng only a small sample of the data nstead of the entre data. The decrease n accuracy due to samplng seems to be nsgnfcant. For samplng wth replacement, the dfference n accuracy between sketchng only a small sample (a fracton of or less from the populaton sze) and sketchng the entre data s mnmal. Moreover, the error becomes stable and t does not decrease anymore f the sample sze s ncreased above a certan threshold (0% n our results). Although the stuaton s almost smlar for samplng wthout replacement, we also observed some unexpected results for ths type of samplng. The smallest error s obtaned when only a sample of the data (0%) s sketched, not the entre dataset. Ths s due to the extreme behavor of F- AGMS sketches n some partcular stuatons 4. Essentally, sketchng more data can actually decrease the accuracy of F- AGMS sketch estmators f the sketched sample captures well enough the dstrbuton of the entre dataset. Summarzng, the expermental results show that the sketch over samples estmator has almost smlar accuracy to the sketch estmator startng wth samplng rates of 0% or smaller. VIII. CONCLUSIONS In ths paper we ntroduce the sketch over samples estmator for sze of jon and self-jon sze. We provde a detaled analyss of the error of ths estmator for generc samplng based on the moment generatng functon of the samplng frequences. Then, we nstantate the general results for three dfferent types of samplng: Bernoull samplng, samplng wth replacement, and samplng wthout replacement. Our results show that t s possble to express the varance of the combned estmator as the sum of the varance of the samplng estmator, the varance of the sketch estmator, and an nteracton term. Although the sketch varance seems to be the domnant term, the exact mportance of each of the terms s hghly dependant on the exact dstrbuton of the actual data. We provde expermental results that show that the accuracy of the combned estmator s almost smlar (and sometmes better) to the accuracy of the sketch estmator even for small samplng rates of 0%. We also dentfy possble applcatons of the sketch over samples estmator for each type of samplng. In concluson, we beleve that the sketch over samples estmator can be used nstead of the sketch estmator wthout a sgnfcant degradaton n accuracy and wth a clear gan n processng tme as long as the sample rate s around 0%. IX. ACKNOWLEDGMENTS Materal n ths paper s based upon work supported by the Natonal Scence Foundaton under Grant No. 044864. We would lke to thank the anonymous referees for ther useful revews that helped us to mprove the qualty of the paper. REFERENCES N. Alon, Y. Matas, and M. Szegedy, The space complexty of approxmatng the frequency moments, n STOC 996, pp. 0 9. N. Alon, P. Gbbons, Y. Matas, and M. Szegedy, Trackng jon and self-jon szes n lmted storage, J. Comput. Syst. Sc., vol. 64, no. 3, pp. 79 747, 00. 3 G. Cormode and M. Garofalaks, Sketchng streams through the net: dstrbuted approxmate query trackng, n VLDB 005, pp. 3 4. 4 F. Rusu and A. Dobra, Statstcal analyss of sketch estmators, n SIGMOD 007, pp. 87 98. 5 N. Tatbul, U. Çetntemel, S. Zdonk, M. Chernack, and M. Stonebraker, Load sheddng n a data stream manager, n VLDB 003, pp. 309 30. 6 P. Domngos and G. Hulten, Mnng hgh-speed data streams, n KDD 000, pp. 7 80. 7 J. Shao, Mathematcal Statstcs. Sprnger-Verlag, 999. 8 P. Haas and J. Hellersten, Rpple jons for onlne aggregaton, n SIGMOD 999, pp. 87 98. 9 C. Jermane, S. Arumugam, A. Pol, and A. Dobra, Scalable approxmate query processng wth the dbo engne, n SIGMOD 007, pp. 75 736. 0 P. Gbbons, Y. Matas, and V. Poosala, Fast ncremental mantenance of approxmate hstograms, n VLDB 997, pp. 466 475. C. Jermane, A. Dobra, S. Arumugam, S. Josh, and A. Pol, The sortmerge-shrnk jon, ACM TODS, vol. 3, no. 4, pp. 38 46. G. Cormode and M. Garofalaks, Sketchng probablstc data streams, n SIGMOD 007, pp. 8 9. 3 T. S. Jayram, A. McGregor, S. Muthukrshnan, and E. Vee, Estmatng statstcal aggregates on probablstc data streams, n PODS 007, pp. 43 5. 4 S. Bhattacharyya, A. Madera, S. Muthukrshnan, and T. Ye, How to scalably and accurately skp past streams, n ICDE Workshops 007, pp. 654 663. 5 R. Motwan and P. Raghavan, Randomzed Algorthms. Cambrdge Unversty Press, 995. 6 N. L. Johnson, S. Kotz, and N. Balakrshnan, Dscrete Multvarate Dstrbutons. John Wley & Sons, Inc., 996. 7 F. Rusu and A. Dobra, Pseudo-random number generaton for sketchbased estmatons, ACM TODS, vol. 3, no., p., 007. 8 F. Olken, Random Samplng from Databases Ph.D. Thess. UC Berkeley, 993.