Estimating Frequency Moments of Streams

Similar documents
Luby s Alg. for Maximal Independent Sets using Pairwise Independence

1 Example 1: Axis-aligned rectangles

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Recurrence. 1 Definitions and main statements

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

What is Candidate Sampling

Calculation of Sampling Weights

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Sketching Sampled Data Streams

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

We are now ready to answer the question: What are the possible cardinalities for finite fields?

A machine vision approach for detecting and inspecting circular parts

CHAPTER 14 MORE ABOUT REGRESSION

BERNSTEIN POLYNOMIALS

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Support Vector Machines

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

SIMPLE LINEAR CORRELATION

PERRON FROBENIUS THEOREM

This circuit than can be reduced to a planar circuit

Realistic Image Synthesis

Level Annuities with Payments Less Frequent than Each Interest Period

Simple Interest Loans (Section 5.1) :

8 Algorithm for Binary Searching in Trees

n + d + q = 24 and.05n +.1d +.25q = 2 { n + d + q = 24 (3) n + 2d + 5q = 40 (2)

Extending Probabilistic Dynamic Epistemic Logic

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

The Greedy Method. Introduction. 0/1 Knapsack Problem

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Quantization Effects in Digital Filters

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Fisher Markets and Convex Programs

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Calculating the high frequency transmission line parameters of power cables

J. Parallel Distrib. Comput.

DEFINING %COMPLETE IN MICROSOFT PROJECT

The OC Curve of Attribute Acceptance Plans

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Chapter 2 The Basics of Pricing with GLMs

Finite Math Chapter 10: Study Guide and Solution to Problems

OPTIMAL INVESTMENT POLICIES FOR THE HORSE RACE MODEL. Thomas S. Ferguson and C. Zachary Gilstein UCLA and Bell Communications May 1985, revised 2004

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Figure 1. Inventory Level vs. Time - EOQ Problem

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

where the coordinates are related to those in the old frame as follows.

GENETIC ALGORITHM FOR PROJECT SCHEDULING AND RESOURCE ALLOCATION UNDER UNCERTAINTY

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Rotation Kinematics, Moment of Inertia, and Torque

1 De nitions and Censoring

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Lecture 2 Sequence Alignment. Burr Settles IBS Summer Research Program 2008 bsettles@cs.wisc.edu

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

A Performance Analysis of View Maintenance Techniques for Data Warehouses

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Solution: Let i = 10% and d = 5%. By definition, the respective forces of interest on funds A and B are. i 1 + it. S A (t) = d (1 dt) 2 1. = d 1 dt.

Consider a 1-D stationary state diffusion-type equation, which we will call the generalized diffusion equation from now on:

Project Networks With Mixed-Time Constraints

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Sensor placement for leak detection and location in water distribution networks

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Enabling P2P One-view Multi-party Video Conferencing

The Stochastic Guaranteed Service Model with Recourse for Multi-Echelon Warehouse Management

IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Formulating & Solving Integer Problems Chapter

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

A Probabilistic Theory of Coherence

EXAMPLE PROBLEMS SOLVED USING THE SHARP EL-733A CALCULATOR

An RFID Distance Bounding Protocol

Managing Resource and Servent Reputation in P2P Networks

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Estimation of Dispersion Parameters in GLMs with and without Random Effects

Analysis of Energy-Conserving Access Protocols for Wireless Identification Networks

Section 5.4 Annuities, Present Value, and Amortization

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Automating Analysis of Large-Scale Botnet Probing Events

INSTITUT FÜR INFORMATIK

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

On fourth order simultaneously zero-finding method for multiple roots of complex polynomial equations 1

Usage of LCG/CLCG numbers for electronic gambling applications

NMT EE 589 & UNM ME 482/582 ROBOT ENGINEERING. Dr. Stephen Bruder NMT EE 589 & UNM ME 482/582

Adaptive Fractal Image Coding in the Frequency Domain

How To Calculate An Approxmaton Factor Of 1 1/E

A role based access in a hierarchical sensor network architecture to provide multilevel security

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Transcription:

Estmatng Frequency Moments of Streams In ths class we wll look at the two smple sketches for estmatng the frequency moments of a stream. The analyss wll ntroduce two mportant trcks n probablty boostng the accuracy of a random varable by consdeer the medan of means of multple ndependent copes of the random varable, and usng k-wse ndependent sets of random varable. Frequency Moments Consder a stream S = {a, a 2,..., a m } wth elements from a doman D = {v, v 2,..., v n }. Let m denote the frequency (also sometmes called multplcty of value v D;.e., the number of tmes v appears n S. The k th frequency moment of the stream s defned as: F k = m k ( We wll develop algorthms that can approxmate F k by makng one pass of the stream and usng a small amount of memory o(n + m. Frequency moments have a number of applcatons. F 0 represents the number of dstnct elements n the streams (whch the FM-sketch from last class estmates usng O(log n space. F s the number of elements n the stream m. F 2 s used n database optmzaton engnes to estmate self jon sze. Consder the query, return all pars of ndvduals that are n the same locaton. Such a query has cardnalty equal to m2 /2, where m s the number of ndvduals at a locaton. Dependng on the estmated sze of the query, the database can decde (wthout actually evaluatng the answer whch query answerng strategy s best suted. F 2 s also used to measure the nformaton n a stream. In general, F k represents the degree of skew n the data. If F k /F 0 s large, then there are some values n the doman that repeat more frequently than the rest. Estmatng the skew n the data also helps when decdng how to partton data n a dstrbuted system. 2 AMS Sketch Lets frst assume that we know m. Construct a random varable X as follows: = Choose a random element from the stream x = a. Let r = {a j j, a j = a }, or the number of tmes the value x appears n the rest of the stream (nclusve of a. X = m(r k (r k X can be constructng usng O(log n + log m space log n bts to store the value x, and log m bts to mantan r. Exercse: We assumed that we know the number of elements n the stream. However the above can be modfed to work even when m s unknown. (Hnt: reservor samplng. It s easy to see that X s an unbased estmator of F k.

E(X = m = = m = m m = j= m E(X th element n the stream was pcked m E(X a s the k th repetton of v j j= k= j= m k j k + (2 k k +... + (m k j (m j k = F k We now show how to use multple such random varables X to estmate F k wthn ɛ relatve error wth hgh probablty ( δ. 2. Medan of Means Suppose X s a random varable such that E(X = µ and V ar(x < cµ 2, for some c > 0. Then, we can construct an estmator Z such that for all ɛ > 0 and δ > 0, E(Z = E(X = µ and P ( Z µ > ɛµ < δ (2 by averagng s = Θ(c/ɛ 2 ndependent copes of X, and then takng the medan of s 2 = Θ(log(/δ such averages. Means: Let X,..., X s be s copes of X. Let Y = s X. Clearly, E(Y = E(X = µ. Therefore, f s = 8c ɛ 2, then P ( Y µ > ɛµ < 8. V ar(y = V ar(x < cµ2 s s P ( Y µ > ɛµ < V ar(y ɛ 2 µ 2 by Chebyshev Medan of means: Now let Z be the medan of s 2 copes of Y. Let W be defned as follows: { f Y µ > ɛµ W = 0 else From the prevous result about Y, E(W = ρ < 8. Therefore, E( W < s 2 /8. Moreover, 2

whenever the medan Z s outsde the nterval µ ± ɛ, W > s 2 /2. Therefore, P ( Z µ > ɛµ < P ( W > s 2 /2 P ( W E( W > s 2 /2 s 2 ρ = P ( W E( W > ( 2ρ s 2ρ ( 2e 3 2 s2 2ρ ρ by Chernoff bounds < 2e s 2 3 when ρ < 8, ρ ( 2ρ 2 > Therefore, takng the medan of s 2 = 3 log ( 2 δ ensures that P ( Z µ > ɛµ < δ. 2.2 Back to AMS We use the medans of means approach to boost the accuracy of the AMS random varable X. For that, we need to bound the varance of X by c F 2 k. V ar(x = E(X 2 E(X 2 E(X 2 = When a > b > 0, we have m 2 m = 2k + (2 2k 2k +... + (m 2k (m 2k k a k b k = (a b( a j b k j (a b(ka k j=0 Therefore, E(X 2 m k 2k + (k2 k (2 k k +... + km k (m k (m k m km 2k + km 2k 2 +... + km 2k n = kf F 2k Exercse: We can show that for all postve ntegers m, m 2,..., m n, ( ( m m 2k n k ( Therefore, we get that V ar(x kn k Fk 2. Hence, by usng the medan of means aggregaton technque, we can estmate F k wthn a relatve error of ɛ wth probablty at least ( δ usng O(kn k log ( ɛ 2 δ ndependent estmators (each of whch take O(log n + log m space. m k 2 3

3 A smpler sketch for F 2 Usng the above analyss we can estmate F 2 usng O( n (log n + log m log ( ɛ 2 δ bts. However, we can estmate F 2 usng much smaller number of bts as follows. Suppose we have n ndependent unform random varables x, x 2,..., x n each takng values n {, }. (Ths requres n bts of memory, but we wll show how to do ths n O(log n bts n the next secton. We compute a sketch as follows: Compute r = n = x m Return r 2 as an estmate for F 2. Note that r can be mantaned as the new elements are seen n the stream by ncreasng/decreasng r by dependng on the sgn of x. Why does ths work? E(r 2 = E( x m 2 = m 2 Ex 2 + 2 <j Ex x j m m j = m 2 = F 2 snce x, x j are ndependent, E(x x j s 0 V ar(r 2 = E(r 4 F2 2 E(r 4 = E ( x m 2 ( x m 2 = E(( x 2 m 2 + (2 <j x x j m m j 2 = E( x 2 m 2 2 + 4E( <j x x j m m j 2 + 4E( x 2 m 2 ( <j x x j m m j The last term s 0 snce every par of varables x and x j are ndependent. Snce x 2 term s F2 2. =, the frst V ar(r 2 = E(r 4 F 2 2 = 4E( <j = 4E <j x 2 x 2 jm 2 m 2 j + 4E x x j m m j 2 <j<k<l x x j x k x l m m j m k m l Agan, the last term s 0 snce every set of 4 random varables s ndependent of each other. Therefore, V ar(r 2 = 4 <j m 2 m 2 j 2F 2 2 Therefore, by usng the medan of means method, we can estmate F 2 usng Θ( ɛ 2 log ( δ ndependent estmates. However, the technque we presented needs O(n random bts. We wll reduce ths to O(log n bts n the next secton by usng 4-wse ndependent random varables rather than fully ndependent random varables. 4

3. k-wse Independent Random Varables In the prevous analyss, note that we only needed to use the fact that eveyr set of 4 dstnct random varables x, x j, x k, x l are ndependent of each. We call a set of random varables X = {x,..., x n } to be k-wse ndependent random varables f every subset of k random varables are ndependent. That s: k < 2 <... < k n, P ( k j=x j = a j = P (x j = a j Example: Consder two far cons x and y. Let z be a random varable that returns heads f x and y both lands heads or both land tals (thnk XOR, and tals otherwse. We can easly check that any par of x, y and z are ndependent, but all x, y and z are not ndependent. In the above F 2 sketch, we only need the set of random varables X to be 4-wse ndependent. We can generate 2 n k-wse ndependent varables usng O(n bts usng the random polynomal constructon (and thus generate each F 2 estmate usng O(log n bts. The constructon of 2-wse (or parwse ndependent random varables s shown below. Consder a famly of hash functons H = {h a,b a, b {0, } n }, where each h a,b : {0, } n {0, } n s defned as follows: h a,b (x = ax + b That s a hash functon s constructed by choosng a and b unformly at random from {0, } n. All elements are hashed usng h a,b. The values resultng from applyng H to values n {0, } n are 2 n parwse ndependent random varables. j= Lemma. x, y, P (H(x = y = 2 n Proof: Exercse Lemma 2. x, y, z, w P (H(x = y H(z = w = 2 2n Proof sketch: Consder any hash functon h a,b. Gven hash values y, w for x, z respectvely, we can fnd a unque soluton for the lnear system of equatons nvolvng a, b. Therefore, only one par out of the 2 2n pars wll result n x, z hashng to y, w respectvely. Therefore, the probablty s 2 2n. We can also easly see that the resultng varables H(x are not 3-wse ndependent. For nstance, P (H( = 2 H(2 = 3 H(3 = 00 = 0 Ths s because the frst two has h value force a =, b =, and the thrd hash value s not possble usng h,. The above constructon can be extended to generate k-wse ndependent random varables by usng random polynomals of the form k =0 a x. 5