Estmatng Frequency Moments of Streams In ths class we wll look at the two smple sketches for estmatng the frequency moments of a stream. The analyss wll ntroduce two mportant trcks n probablty boostng the accuracy of a random varable by consdeer the medan of means of multple ndependent copes of the random varable, and usng k-wse ndependent sets of random varable. Frequency Moments Consder a stream S = {a, a 2,..., a m } wth elements from a doman D = {v, v 2,..., v n }. Let m denote the frequency (also sometmes called multplcty of value v D;.e., the number of tmes v appears n S. The k th frequency moment of the stream s defned as: F k = m k ( We wll develop algorthms that can approxmate F k by makng one pass of the stream and usng a small amount of memory o(n + m. Frequency moments have a number of applcatons. F 0 represents the number of dstnct elements n the streams (whch the FM-sketch from last class estmates usng O(log n space. F s the number of elements n the stream m. F 2 s used n database optmzaton engnes to estmate self jon sze. Consder the query, return all pars of ndvduals that are n the same locaton. Such a query has cardnalty equal to m2 /2, where m s the number of ndvduals at a locaton. Dependng on the estmated sze of the query, the database can decde (wthout actually evaluatng the answer whch query answerng strategy s best suted. F 2 s also used to measure the nformaton n a stream. In general, F k represents the degree of skew n the data. If F k /F 0 s large, then there are some values n the doman that repeat more frequently than the rest. Estmatng the skew n the data also helps when decdng how to partton data n a dstrbuted system. 2 AMS Sketch Lets frst assume that we know m. Construct a random varable X as follows: = Choose a random element from the stream x = a. Let r = {a j j, a j = a }, or the number of tmes the value x appears n the rest of the stream (nclusve of a. X = m(r k (r k X can be constructng usng O(log n + log m space log n bts to store the value x, and log m bts to mantan r. Exercse: We assumed that we know the number of elements n the stream. However the above can be modfed to work even when m s unknown. (Hnt: reservor samplng. It s easy to see that X s an unbased estmator of F k.
E(X = m = = m = m m = j= m E(X th element n the stream was pcked m E(X a s the k th repetton of v j j= k= j= m k j k + (2 k k +... + (m k j (m j k = F k We now show how to use multple such random varables X to estmate F k wthn ɛ relatve error wth hgh probablty ( δ. 2. Medan of Means Suppose X s a random varable such that E(X = µ and V ar(x < cµ 2, for some c > 0. Then, we can construct an estmator Z such that for all ɛ > 0 and δ > 0, E(Z = E(X = µ and P ( Z µ > ɛµ < δ (2 by averagng s = Θ(c/ɛ 2 ndependent copes of X, and then takng the medan of s 2 = Θ(log(/δ such averages. Means: Let X,..., X s be s copes of X. Let Y = s X. Clearly, E(Y = E(X = µ. Therefore, f s = 8c ɛ 2, then P ( Y µ > ɛµ < 8. V ar(y = V ar(x < cµ2 s s P ( Y µ > ɛµ < V ar(y ɛ 2 µ 2 by Chebyshev Medan of means: Now let Z be the medan of s 2 copes of Y. Let W be defned as follows: { f Y µ > ɛµ W = 0 else From the prevous result about Y, E(W = ρ < 8. Therefore, E( W < s 2 /8. Moreover, 2
whenever the medan Z s outsde the nterval µ ± ɛ, W > s 2 /2. Therefore, P ( Z µ > ɛµ < P ( W > s 2 /2 P ( W E( W > s 2 /2 s 2 ρ = P ( W E( W > ( 2ρ s 2ρ ( 2e 3 2 s2 2ρ ρ by Chernoff bounds < 2e s 2 3 when ρ < 8, ρ ( 2ρ 2 > Therefore, takng the medan of s 2 = 3 log ( 2 δ ensures that P ( Z µ > ɛµ < δ. 2.2 Back to AMS We use the medans of means approach to boost the accuracy of the AMS random varable X. For that, we need to bound the varance of X by c F 2 k. V ar(x = E(X 2 E(X 2 E(X 2 = When a > b > 0, we have m 2 m = 2k + (2 2k 2k +... + (m 2k (m 2k k a k b k = (a b( a j b k j (a b(ka k j=0 Therefore, E(X 2 m k 2k + (k2 k (2 k k +... + km k (m k (m k m km 2k + km 2k 2 +... + km 2k n = kf F 2k Exercse: We can show that for all postve ntegers m, m 2,..., m n, ( ( m m 2k n k ( Therefore, we get that V ar(x kn k Fk 2. Hence, by usng the medan of means aggregaton technque, we can estmate F k wthn a relatve error of ɛ wth probablty at least ( δ usng O(kn k log ( ɛ 2 δ ndependent estmators (each of whch take O(log n + log m space. m k 2 3
3 A smpler sketch for F 2 Usng the above analyss we can estmate F 2 usng O( n (log n + log m log ( ɛ 2 δ bts. However, we can estmate F 2 usng much smaller number of bts as follows. Suppose we have n ndependent unform random varables x, x 2,..., x n each takng values n {, }. (Ths requres n bts of memory, but we wll show how to do ths n O(log n bts n the next secton. We compute a sketch as follows: Compute r = n = x m Return r 2 as an estmate for F 2. Note that r can be mantaned as the new elements are seen n the stream by ncreasng/decreasng r by dependng on the sgn of x. Why does ths work? E(r 2 = E( x m 2 = m 2 Ex 2 + 2 <j Ex x j m m j = m 2 = F 2 snce x, x j are ndependent, E(x x j s 0 V ar(r 2 = E(r 4 F2 2 E(r 4 = E ( x m 2 ( x m 2 = E(( x 2 m 2 + (2 <j x x j m m j 2 = E( x 2 m 2 2 + 4E( <j x x j m m j 2 + 4E( x 2 m 2 ( <j x x j m m j The last term s 0 snce every par of varables x and x j are ndependent. Snce x 2 term s F2 2. =, the frst V ar(r 2 = E(r 4 F 2 2 = 4E( <j = 4E <j x 2 x 2 jm 2 m 2 j + 4E x x j m m j 2 <j<k<l x x j x k x l m m j m k m l Agan, the last term s 0 snce every set of 4 random varables s ndependent of each other. Therefore, V ar(r 2 = 4 <j m 2 m 2 j 2F 2 2 Therefore, by usng the medan of means method, we can estmate F 2 usng Θ( ɛ 2 log ( δ ndependent estmates. However, the technque we presented needs O(n random bts. We wll reduce ths to O(log n bts n the next secton by usng 4-wse ndependent random varables rather than fully ndependent random varables. 4
3. k-wse Independent Random Varables In the prevous analyss, note that we only needed to use the fact that eveyr set of 4 dstnct random varables x, x j, x k, x l are ndependent of each. We call a set of random varables X = {x,..., x n } to be k-wse ndependent random varables f every subset of k random varables are ndependent. That s: k < 2 <... < k n, P ( k j=x j = a j = P (x j = a j Example: Consder two far cons x and y. Let z be a random varable that returns heads f x and y both lands heads or both land tals (thnk XOR, and tals otherwse. We can easly check that any par of x, y and z are ndependent, but all x, y and z are not ndependent. In the above F 2 sketch, we only need the set of random varables X to be 4-wse ndependent. We can generate 2 n k-wse ndependent varables usng O(n bts usng the random polynomal constructon (and thus generate each F 2 estmate usng O(log n bts. The constructon of 2-wse (or parwse ndependent random varables s shown below. Consder a famly of hash functons H = {h a,b a, b {0, } n }, where each h a,b : {0, } n {0, } n s defned as follows: h a,b (x = ax + b That s a hash functon s constructed by choosng a and b unformly at random from {0, } n. All elements are hashed usng h a,b. The values resultng from applyng H to values n {0, } n are 2 n parwse ndependent random varables. j= Lemma. x, y, P (H(x = y = 2 n Proof: Exercse Lemma 2. x, y, z, w P (H(x = y H(z = w = 2 2n Proof sketch: Consder any hash functon h a,b. Gven hash values y, w for x, z respectvely, we can fnd a unque soluton for the lnear system of equatons nvolvng a, b. Therefore, only one par out of the 2 2n pars wll result n x, z hashng to y, w respectvely. Therefore, the probablty s 2 2n. We can also easly see that the resultng varables H(x are not 3-wse ndependent. For nstance, P (H( = 2 H(2 = 3 H(3 = 00 = 0 Ths s because the frst two has h value force a =, b =, and the thrd hash value s not possble usng h,. The above constructon can be extended to generate k-wse ndependent random varables by usng random polynomals of the form k =0 a x. 5