Space-Efficient Estimation of Statistics over Sub-Sampled Streams



Similar documents
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

I. Chi-squared Distributions

Properties of MLE: consistency, asymptotic normality. Fisher information.

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

A probabilistic proof of a binomial identity

Hypothesis testing. Null and alternative hypotheses

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Chapter 7 Methods of Finding Estimators

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

CHAPTER FIVE Network Hydraulics

Department of Computer Science, University of Otago

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Incremental calculation of weighted mean and variance

The Stable Marriage Problem

Output Analysis (2, Chapters 10 &11 Law)


Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

MARTINGALES AND A BASIC APPLICATION

5 Boolean Decision Trees (February 11)

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Lecture 2: Karger s Min Cut Algorithm

A short note on quantile and expectile estimation in unequal probability samples


Modified Line Search Method for Global Optimization

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

1 Computing the Standard Deviation of Sample Means

Overview of some probability distributions.

Universal coding for classes of sources

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

CHAPTER 3 THE TIME VALUE OF MONEY

How To Solve The Homewor Problem Beautifully

A Recursive Formula for Moments of a Binomial Distribution

5: Introduction to Estimation

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

Maximum Likelihood Estimators.

Estimating Probability Distributions by Observing Betting Practices

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

INVESTMENT PERFORMANCE COUNCIL (IPC)

Lesson 15 ANOVA (analysis of variance)

LECTURE 13: Cross-validation

3 Basic Definitions of Probability Theory

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu

Infinite Sequences and Series

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

1. C. The formula for the confidence interval for a population mean is: x t, which was

Confidence Intervals for One Mean

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

Normal Distribution.

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Convention Paper 6764

1. MATHEMATICAL INDUCTION

Asymptotic Growth of Functions

Soving Recurrence Relations

THE HEIGHT OF q-binary SEARCH TREES

Determining the sample size

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

CHAPTER 3 DIGITAL CODING OF SIGNALS

THE ABRACADABRA PROBLEM

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

Section 11.3: The Integral Test

Lesson 17 Pearson s Correlation Coefficient

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Supervised Rank Aggregation

Measures of Spread and Boxplots Discrete Math, Section 9.4

3. Greatest Common Divisor - Least Common Multiple

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Optimal Adaptive Bandwidth Monitoring for QoS Based Retrieval

Simple Annuities Present Value.

Performance Modelling of W-CDMA Networks Supporting Elastic and Adaptive Traffic

Chapter 7: Confidence Interval and Sample Size

Math C067 Sampling Distributions

Capacity of Wireless Networks with Heterogeneous Traffic

Plug-in martingales for testing exchangeability on-line

ODBC. Getting Started With Sage Timberline Office ODBC

Chapter 14 Nonparametric Statistics

PSYCHOLOGICAL STATISTICS

Sequences and Series

Lecture 4: Cheeger s Inequality

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

3D Partitioning for Interference and Area Minimization

Research Article Sign Data Derivative Recovery

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

A Mathematical Perspective on Gambling

Transcription:

Noame mauscript No. wi be iserted by the editor Space-Efficiet Estimatio of Statistics over Sub-Samped Streams Adrew McGregor A. Pava Srikata Tirthapura David Woodruff the date of receipt ad acceptace shoud be iserted ater Abstract I may stream moitorig situatios, the data arriva rate is so high that it is ot eve possibe to observe each eemet of the stream. The most commo soutio is to subsampe the data stream ad use the sampe to ifer properties ad estimate aggregates of the origia stream. However, i may cases, the estimatio of aggregates o the origia stream caot be accompished through simpy estimatig them o the samped stream, foowed by a ormaizatio. We preset agorithms for estimatig frequecy momets, support size, etropy, ad heavy hitters of the origia stream, through a sige pass over the samped stream. Keywords data streams, frequecy momets, sub-sampig 1 Itroductio I may stream moitorig situatios, the data arriva rate is so high that it is possibe to observe each eemet i the stream. The most commo soutio is to sub-sampe the data stream ad use the sampe to ifer properties of the origia stream. For exampe, i a IP router, aggregated statistics of the packet stream are maitaied through a protoco such as Netfow [9]. I high-ed routers, the oad due to statistics maiteace ca be so high that a variat of Netfow caed samped Netfow has bee deveoped. I radomy samped etfow, the moitor gets to view oy a radom sampe of the packet stream, ad must maitai statistics o the origia stream, usig this view. I such scearios of extreme data deuge, we are faced with two costraits o data processig. First, the etire data set is ot see by the moitor; oy a radom sampe is Adrew McGregor Uiversity of Massachusetts, E-mai: mcgregor@cs.umass.edu. Supported by NSF CAREER Award CCF- 0953754. A. Pava Iowa State Uiversity, E-mai: pava@cs.iastate.edu. Supported i part by NSF CCF-0916797. Srikata Tirthapura Iowa State Uiversity, E-mai: st@iastate.edu. Supported i part by NSF CNS-0834743, CNS-0831903. David P. Woodruff IBM Amade, E-mai: dpwoodru@us.ibm.com

Adrew McGregor et a. see. Secod, eve the radom sampe of the iput is too arge to be stored i mai memory or i secodary memory, ad must be processed i a sige pass through the data, as i the usua data stream mode. Whie there has bee a arge body of work that has deat with data processig usig a radom sampe see for exampe, [3, 4], ad extesive work o the oe-pass data stream mode see for exampe, [1, 9, 33], there has bee itte work so far o data processig i the presece of both costraits, where oy a radom sampe of the data set must be processed i a streamig fashio. We ote that the estimatio of frequecy momets over a samped stream is oe of the ope probems from [31], posed as Questio 13, Effects of Subsampig. 1.1 Probem Settig We assume the settig of Beroui sampig, described as foows. Cosider a iput stream P = a 1,a,...,a where a i {1,,...,m}. For a parameter p, 0 < p 1, a sub-stream of P, deoted L is costructed as foows. For 1 i, a i is icuded i L with probabiity p. The stream processor is oy aowed to see L, ad caot see P. The goa is to estimate properties of P through processig stream L. I the foowig discussio, L is caed the samped stream, ad P is caed the origia stream. 1. Our Resuts We preset agorithms ad ower bouds for estimatig key aggregates of a data stream by processig a radomy samped substream. We cosider the basic frequecy reated aggregates, icudig the umber of distict eemets, the frequecy momets, the empirica etropy of the frequecy distributio, ad the heavy hitters. 1. Frequecy Momets: For the frequecy momets F k for k, we preset 1 + ε,δ- approximatio agorithms with space compexity 1 Õp 1 m 1 /k. This resut yieds a iterestig tradeoff betwee the sampig probabiity ad the space used by the agorithm. The smaer the sampig probabiity up to a certai miimum probabiity, the greater is the streamig space compexity of our agorithm. The agorithm is preseted i Sectio 3.. Distict Eemets: For the umber of distict eemets, F 0, we show that the curret best offie methods for estimatig F 0 from a radom sampe ca be impemeted i a streamig fashio usig very sma space. Whie it is kow that radom sampig ca sigificaty reduce the accuracy of a estimate for F 0 [7], we show that the eed to process this stream usig sma space does ot. The upper ad ower bouds are preseted i Sectio 4. 3. Etropy: For estimatig etropy we first show that o mutipicative approximatio is possibe i geera eve whe p is costat. However, we show that estimatig the empirica etropy o the samped stream yieds a costat factor approximatio to the etropy of the origia stream if the etropy is arger tha some vaishigy sma fuctio of p ad. These resuts are preseted i Sectio 5. 1 Where Õ otatio suppresses factors poyomia i 1/ε ad 1/δ ad factors ogarithmic i m ad.

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 3 4. Heavy Hitters: We show tight bouds for idetifyig a set of O1/α eemets whose frequecy exceeds αf 1/k k for k {1,}. I the case of k = 1, we show that existig heavy hitter agorithms ca be used if the stream is sufficiety og compared with p. I the case of k =, we show how to adapt ideas used i Sectio 3 to arrive at a agorithm that uses space Õ1/p. Aother way of iterpretig our resuts is i terms of time-space tradeoffs for data stream probems. Amost every streamig agorithm has a time compexity of at east, sice the agorithm reads ad processes each stream update. We show that for estimatig F k ad other probems it is uecessary to process each update; istead, it suffices for the agorithm to read each item idepedety with probabiity p, ad maitai a data structure of size Õp 1 m 1 /k. Iterestigy, the time to update the data structure per samped stream item is sti oy Õ1. The time to output a estimate at the ed of observatio is Õp 1 m 1 /k, i.e., roughy iear i the size of the data structure. As a exampe of the type of tradeoffs that are achievabe, for estimatig F if = Θm we ca set p = Θ1/ ad obtai a agorithm usig Õ tota processig time ad Õ workspace. 1.3 Reated Work There is a arge body of prior work reated at the itersectio of radom sampig ad data stream processig. Some of this work is aog the ies of methods for radom sampig from a data stream, icudig the reservoir sampig agorithm, attributed to Waterma aso see [37]. There has bee much foow up o variats ad geeraizatios of reservoir sampig, see for exampe [,16,0,30,36]. Whie this ie of work focuses o how to efficiety sampe from a stream, our work focuses o how to process a stream that has aready bee samped. Stream sampig is a we-researched method for maagig the oad o etwork moitors, whie eabig accurate measuremet. Packets are grouped ito fows based o the vaues of certai attributes withi the packet header. Oe commoy used sampig method is the samped etfow mode NF [3], which is the same as the Beroui sampig that we cosider here, where packets are samped idepedet of each other. Other methods of sampig are aso cosidered uder the geera umbrea of samped etfow, such as determiistic sampig oe of out every packets. Aother sampig method is the sampe-ad-hod mode SH [], where, oce a packet is samped from a fow, a other packets beogig that fow are aso samped. The priority sampig procedure [19] is a method for sampig from a weighted stream so that we ca get ubiased estimators of idividua weights with sma variace. Szegedy [35] has show that the priority sampig method of [19] essetiay gets the smaest possibe variace, give a fixed sampe size. I additio, various combiatios ad ehacemets to these sampig mechaisms have bee proposed [10 1, 1]. I particuar, [1] presets methods for better tuig sampig parameters ad for exportig partia summaries to sower storage, [1] presets methods that dyamicay adapt the sampig rate to achieve a desired eve of accuracy, [10] preset structure-aware sampig methods that provide improved accuracy whe compared with NF o specific rage queries of iterest, ad [11] presets stream sampig schemes for variace-optima estimatio of the tota weight of a arbitrary subset of the stream of a certai size. There is much other work aog the ies of optimizig sampig methods for accurate estimatio of a specific cass of aggregates o the origia stream. Typica aggregates of iterest icude the distributio of the umber of packets i differet fows, ad

4 Adrew McGregor et a. aggregates over sub-popuatios of a fows. The above ie of work taiors the sampig scheme towards specific goas, whie we cosider a simpe but geera sampig scheme, Beroui sampig, ad expore how to efficiety process data uder this sampig strategy. I may situatios, icudig with samped etfow, the sampig strategy is aready decided by a extera etity, such as the router, over which we may ot have cotro. Duffied et a. [17] cosider the estimatio of the sizes of IP fows ad the umber of IP fows i a packet stream through observig the samped stream. I a foow up work [18], they provide methods for estimatig the distributio of the sizes of the iput fows by observig sampes of the origia stream; this ca be viewed as costructig a approximate histogram. The techiques used here are maximum ikeihood estimatio, as we as protoco eve detai at the IP ad TCP eve. Other work aog this ies icudes the work o ivertig samped traffic [6] which aims to recover the distributio of the origia traffic through aayzig the sampe, ad work i [5, 13] which seeks to aswer top-k queries ad rak fows through aayzig the sampe. Whie this ie of work deas with iferece from a radom sampe i detai, it does ot cosider the issue of processig the sampe i a streamig maer usig imited space, as we do here. Further, we cosider aggregates such as frequecy momets ad etropy, which do ot seem to have bee ivestigated i detai o samped streams i prior work o etwork moitorig. I particuar, eve whe the space compexity of a agorithm is high, we preset space ower bouds that hep uderstad the exted to which these aggregates ca be estimated. Rusu ad Dobra [34] cosider the estimatio of the secod frequecy momet of a stream, equivaety, the size of the sef-joi, through processig the samped stream. Our work differs from theirs i the foowig ways. Whie [34] do ot expicity metio the space boud of their agorithm, we derived a 1 + ε,δ estimator for F based o their agorithm ad foud that the estimator took Õ1/p space. We improve the depedece o the sampig probabiity ad obtai a agorithm that oy requires Õ1/p space. This depedece o the sampig probabiity p is optima. Our techique is aso differet from theirs. Ours reies o coutig the umber of coisios i the samped stream, whie theirs reies o scaig a estimate of the secod frequecy momet of the samped stream. We aso cosider higher frequecy momets F k, for k >, as we as the etropy, whie they do ot. Bhattacharya et a. [6] cosider stream processig i the mode where the stream processor ca adaptivey skip past stream eemets, ad ook at oy a fractio of the iput stream, thus speedig up stream computatio. I their mode, the stream processor has the power to decide which eemets to see ad which to skip past, hece it is adaptive ; i our mode, the stream processor does ot have such power, ad must dea with the radomy samped stream that is preseted to it. Our mode refects the setup i curret etwork moitorig equipmet, such as Radomy Samped Netfow [9]. They preset a costat factor approximatio for F, whie we preset 1+ε,δ approximatios for a frequecy momets F k for k. Bar-Yossef [3] presets ower bouds o the sampig probabiity, or equivaety, the umber of sampes eeded to estimate certai properties of a data set, icudig the frequecy momets. This yieds a miimum sampig probabiity for the Beroui samper that we cosider, beow which it is ot possibe to estimate aggregates accuratey, whether streamig or otherwise. This is reevat to Theorem 1 i our paper, which assumes that the sampig probabiity must be at east a certai vaue. There is work o probabiistic data streams [14,8], where the data stream itsef cosists of probabiistic data, ad each eemet of the stream is a probabiity distributio over a

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 5 set of possibe evets. Uike i our mode, the stream processor gets to see the etire iput i the probabiistic streams mode. Remark. The preimiary coferece versio of this paper caimed matchig ower bouds for estimatig F k ad heavy hitters [3]. The caimed ower bouds cruciay deped o ower bouds obtaied i a earier work of Guha ad Huag [4]. However, a probem has bee foud with the bouds of [4]. Thus the ower boud proofs that were preseted i [3] do ot hod. Notatio ad Preimiaries Throughout this paper, we wi deote the origia egth- stream by P = a 1,a,...,a ad wi assume that each eemet a i {1,,...,m}. We deote the sampig probabiity with p. The samped stream L is costructed by icudig each a i i L with probabiity p, idepedet of the other eemets. It is assumed that the sampig probabiity p is fixed i advace ad is kow to the agorithm. Throughout et f i be the frequecy of item i i the origia stream P. Let g i be the frequecy i the sub-samped stream ad ote that g i Bi f i, p. The streams P ad L defie frequecy vectors f = f 1, f,..., f m ad g = g 1,g,...,g m respectivey. Whe cosiderig a fuctio F o a stream e.g., a frequecy momet or the etropy we wi deote FP ad FL to idicate that vaue of the fuctio o the origia ad samped stream respectivey. Whe the cotext is cear, we wi aso abuse otatio ad use F to idicate FP. We are primariy iterested i radomized mutipicative approximatios. Defiitio 1 For α > 1 ad δ [0,1], we say X is a a α,δ-estimator for X if Pr [ α 1 X/ X α ] 1 δ. We use the otatio Õ to suppress factors poyomia i 1/ε, 1/δ ad ogarithmic i. More precisey, give two fuctios f ad g ad costats ε > 0, ad δ > 0, we write f Õg to deote f Opoy1/ε, 1/δ, og g. Simiary we write f Ωg to deote f Ωpoy1/ε,1/δ,ogg. 3 Frequecy Momets I this sectio, we preset a agorithm for estimatig the kth frequecy momet F k. The mai theorem of this sectio is as foows. Theorem 1 For k, there is a oe pass streamig agorithm which observes L ad outputs a 1+ε,δ-estimator for F k P usig Õp 1 m 1 /k space, assumig p = Ωmim, 1/k. For p = õmim, 1/k there is ot eough iformatio i the samped stream to obtai a 1 + ε,δ approximatio to F k P with ay amout of space, see Theorem 4.33 of [3]. Defiitio For 1 k defie the umber of -wise coisios to be C P = m fi ad C L = m gi.

6 Adrew McGregor et a. Our agorithm is based o the foowig coectio betwee the th frequecy momet of a stream ad the -wise coisios i the stream. Lemma 1 For 1 k, 1 F P =! C P + β F P 1 =1 where β = 1 +1 1 j1 <...< j 1 j 1 j j. Proof The reatioship foows from! C P = = = m m f i f i 1... f i 1 m fi f i f 1 i 1 j 1 1 j 1 + fi 1 j 1 1 j 1 1 = F P β F P. =1 m f 1 i + j 1 j... 1 j 1 < j 1 1 j 1 < j 1 j 1 j m f i... The foowig emma reates the expectatio of C L to C P ad bouds the variace. Lemma For 1 k, E[C L] = p C P ad V[C L] = Op 1 F 1/. Proof Let C deote C L. Sice each -wise coisio i P appears i L with probabiity p, we have E[C] = p C P. For each i [m], et C i be the umber of -wise coisios i L amog items that equa i. The C = i [m] C i. By idepedece of the C i, V[C] = V[C i ]. i [m] Fix a i [m]. Let S i be the set of idices i the origia stream equa to i. For each J S i with J =, et X J be a idicator radom variabe if each of the stream eemets i J appears i the samped stream. The C i = J X J. Hece, V[C i ] = J,J E[X J X J ] E[X J ]E[X J ] = p J J p J,J fi = j = j=1 j=1 fi j j O f j i p j. j p j p j

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 7 1/ j Sice F j F 1/ for a j = 1,...,, we have V[C] = O1 j=1 F j p j = O1 j=1 j/ F p j. If we ca show that the first term of this sum domiates, the desired variace boud foows. This is the case if p F 1/ 1, sice this is the ratio of two cosecutive summads. Note that F is miimized for a fixed F 0 ad F 1 whe there are F 0 frequecies each of vaue F 1 /F 0. I this case, Hece, p 1/F 1/ F 1/ = F 0 F 1 /F 0 1/ = F 1 /F 1 1/ 0. if p F 1 1/ 0 /F 1, which hods by assumptio. We ext describe the ituitio behid our agorithm. To estimate F k P, by Eq. 1, it suffices to obtai estimates for F 1 P, F P,...,F k 1 P ad C k P oe of the caveats is that some of the coefficiets of F i P are egative, which we hade as expaied beow. Our agorithm attempts to estimate F P for = 1,,... iductivey. Sice, by Cheroff bouds, F 1 P is very cose to F 1 L/p, F 1 P ca be estimated easiy. Thus our probem reduces to estimatig C k P by observig the sub-samped stream L. Sice the expected umber of coisios i L equas p k C k P, our agorithm wi attempt to estimate C k L, the umber of k-wise coisios i the sub-samped stream. However, it is ot possibe to fid a good reative approximatio of C k L i sma space if C k L is sma. However, whe C k L is sma, it does ot cotribute sigificaty to the fia aswer ad we do ot eed a good reative error approximatio! We oy eed that our estimator does ot grossy over estimate C k L. Our agorithm to estimate C k L wi have the foowig property: If C k L is arge, the it outputs a good reative error approximatio, ad if C k L is sma the it outputs a vaue that is at most 3C k L. Aother caveat is that some of the βi s coud be egative. Thus apriori it is ot cear that our strategy of estimatig F P by estimatig F 1 P, F P,...,F k 1 P, C k P, ad appyig Equatio 1 works. However, by usig a carefu choice of approximatio errors ad the fact that F i P F j P, whe i > j, we argue that this approach succeeds i obtaiig a good approximatio of F P. 3.1 The Agorithm Defie a sequece of radom variabes φ : φ 1 = F 1L p, ad φ = C L! 1 p + β i φ i for > 1. Agorithm 1 iductivey computes a estimate φ i for each φ i. Note that if C L/p takes its expected vaue of C P ad we coud compute C L exacty, the Eq. 1 impies that the agorithm woud retur F k P exacty. Whie this is excessivey optimistic we wi show that C L/p is sufficiety cose to C P with high probabiity ad that we ca costruct a estimate for C L for C L such that the fia resut retured is sti a 1+ε approximatio for F k P with probabiity at east 1 δ.

8 Adrew McGregor et a. Agorithm 1: F k P 1 Compute F 1 L exacty ad set φ 1 = F 1 L/p. for = to k do 3 Let C L be a estimate for C L, computed as described i the text. 4 Compute 5 ed 6 Retur φ k. φ C = L! p + 1 βi φ i We compute our estimate of C L via a agorithm by Idyk ad Woodruff [7]. This agorithm attempts to obtai a 1 + ε 1 approximatio of C L for some vaue of ε 1 to be determied. The estimator is as foows. For i = 0,1,,... defie S i = { j [m] : η1 + ε i g j < η1 + ε i+1 } where η is radomy chose betwee 0 ad 1 ad ε = ε 1 /4. The agorithm of Idyk ad Woodruff [7] returs a estimate s i for S i ad our estimate for C L is defied as η1 + ε C L := i s i i The space used by the agorithm is Õp 1 m 1 /. We defer the detais to Sectio 3.. We ext defie a evet E that correspods to our coisio estimates beig sufficiety accurate ad the samped stream beig we-behaved. The ext emma estabishes that Pr[E ] 1 δ. We wi defer the proof uti Sectio 3.. Lemma 3 Defie the evet E = E 1 E... E k where where ε k = ε, ε 1 = E 1 : φ 1 1 ± ε 1 F 1 P E : C L/p C P ε 1 F P/! for ε A +1, ad A = 1 β i. The Pr[E ] 1 δ. The ext theorem estabishes that, coditioed o the evet E, the agorithm returs a 1 ± ε approximatio of F k P as required. Lemma 4 Coditioed o E, we have φ 1 ± ε F P for a [k]. Proof The proof is by iductio o. Sice we are coditioig o evet E ad thus evet E 1, we have that φ 1 is a 1 ± ε 1 approximatio of F 1 P. Thus the iductio hypothesis esures that φ i, 1 i 1, is a 1 ± ε i approximatio of F i P. Therefore, φ C L! F P = p +!C P + 1 β i 1 β 1 = ε 1 F P + βi ε i F i P φ i F P i F i P F P + ε 1F P + 1 β i F i P

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 9 where the first iequaity foows sice we are coditioig o evet E which esures that C L! p!c P ε 1F P, ad the iductio hypothesis esures that 1 βi 1 φ i βi 1 F i P β i ε i F i P. The secod equaity foows due to Equatio 1. Note that i j impies ε i ε j ad F i P F j P. Hece, by the defiitio of ε, 1 ε 1 F P + βi 1 ε i F i P ε 1 F P 1 + βi = ε F P. Therefore φ 1 ± ε F P as required. 3. Proof of Lemma 3. Our goa is to show that Pr[E 1 E... E k ] 1 δ. To do this it wi suffice to show that for each [k], Pr[E ] 1 δ/k ad appea to the uio boud. We first observe that, by Cheroff bouds, the evet E 1 happes with probabiity at east 1 δ/k. Let X i deote the 0-1 radom variabe whose vaue if 1 if the i item of the origia stream appears i the samped stream. Note that E[X i ] = 1, 1 i, ad F 1 L = X i. Sice φ 1 = F 1 L/p, we have φ 1 = X i/p. Reca that = F 1 P. Pr [ ] [ ] E 1 = Pr φ 1 F 1 P F 1 Pε 1 [ = Pr X ] i p F 1P F 1 Pε 1 [ = Pr X ] i F 1 P p pε 1 e ε 1 F 1Pp/ By Cheroff Boud δ/k The ast iequaity foows because our coditio o p impies p > poy1/εog1/δ F 1 p. To aayze Pr[E ] for k we cosider the evets: E 1 : C L/p C P ε 1F P! E : C L/p C L/p ε 1 F P.! By the triage iequaity it is easy to see that Pr [ E 1 E ] Pr[E ] ad hece it suffices to show that Pr [ E 1 ] [ ] 1 δ/k ad Pr E 1 δ/k. The first part foows easiy from the variace boud i Lemma. Lemma 5 Pr [ E 1 ] 1 δ 4k.

10 Adrew McGregor et a. Proof There are two cases depedig o the vaue of E[C L]. Case I: First assume E[C L] δε 1 p F 8k!. Therefore, by Lemma, we aso kow that By Markov s boud C P δε 1F 8k!. [ Pr C L ε 1 p ] F 1 δ! 4k. 3 Eq. ad Eq. 3 together impy that with probabiity at east 1 δ 4k C L/p C P max C L/p,C P ε 1F! Case II: Next assume E[C L] > δε 1 p F 8k!. By Chebyshev s boud, ad usig Lemma, we get: Pr [ C L E[C L] ε ] 1E[C L] 4V[C L] ε 1 E[C L] Dk! δ ε 4 1 pf1/ Dk! F 1 1/ 0 δ ε 1 4 p F 1 Dk! 1 δ ε 1 4 p mif 1/ 0,F 1/ 1 = Dk! 1 δ H 4 ε 4 p mif 1/ 0,F 1/ 1 δ 4k where D ad H are sufficiety arge costats. The third iequaity foows because F 1/ F 1 /F 1 1/ 0. The equaity foows because ε = H ε 1. The ast iequaity foows because our assumptio o p impies that p poy1/ε,1/δmif 0,F 1 1/k. Sice E[C L] = p C P ad C P F P/!, we have that C Pr[ L/p C P ε ] 1F P 1 δ! 4k as required. We wi ow show that E happes with high probabiity by aayzig the agorithm that computes C L. We eed the foowig resut due to Idyk ad Woodruff [7]. Reca that ε = ε 1 /4. Theorem Idyk ad Woodruff [7] Let G be the set of idices i for which S i 1 + ε i γf L poyε 1 og, 4

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 11 the Pr [ i G, s i 1 ± ε S i ] 1 δ 8k. For every i whether it is i G or ot s i 3 S i. Moreover, the agorithm rus i space Õ1/γ. We say that a set S i cotributes if 1 + ε i S i > C L B. where B = poyε 1 og. Give i the evet that S i cotributes hods with certai coceivaby 0 probabiity. We first show that if S i cotributes, the S i is a good set with high probabiity. More precisey, we show that for every S i that cotributes, Eq. 4 hods with high probabiity with γ = pm 1+/. Lemma 6 Suppose that C L > ε 1 p F P 4!, ad aso suppose that the evet S i cotributes happeed. The [ Pr S i 1 + ε i ] δ pf L m 1 / poyε 1 1 δ og 8k. Proof Cosider a set S i that cotributes. Note that the probabiity that η < 1/poyδ 1 ε 1 og with is at most 1/poyδ 1 ε 1 og. Without oss of geeraity we ca take this probabiity to be ess tha δ/16k. By our assumptio o C L ad the fact that S i cotributes, S i 1 + ε i ε p F P B! hods with probabiity at east 1 δ/8k. Thus S i 1 + ε i ε / p F / P p F P B! / m 1 / poyε 1 og where the secod iequaity is a appicatio of Höder s iequaity. Note that E[F L] = p F P + p1 pf 1 P pf P. Thus, a appicatio of the Markov boud, [ Pr F L 16kpF ] P 1 δ δ 16k. 5 The emma foows as the foowig iequaities hod with probabiity at east 1 δ/8k. S i 1 + ε i p F P m 1 / poyε 1 og δ p16kpf P 16km 1 / poyε 1 og δ pf L m 1 / poyε 1 By 5 og

1 Adrew McGregor et a. Now we are ready to prove that the evet E Lemma 7 Pr [ E ] 1 δ k Proof There are two cases depedig o the size of C L. hods with high probabiity. Case I: Assume C L ε 1 p F P 4!. By Theorem, it foows that C L 3C L. Thus C L C L C L ε 1 p F P! Case : Assume C L > ε 1 p F 4!. By Lemma 6, for every S i that cotributes, [ ] Pr S i 1 + ε i δ pf L m 1 / poyε 1 1 δ og 8k. Now by Theorem for each S i that cotributes s i 1 ± ε S i, with probabiity at east 1 δ 8k. Therefore, If E 1 is true, the: C Pr [ C L C L ε C L ] 1 δ 4k. L C Pp ± ε 1F Pp.! Sice E 1 hods with probabiity at east 1 4k δ, the foowig iequaities hod with probabiity at east 1 k δ. C L C L ε C L ε C Pp + ε 1ε F Pp! ε F Pp! + ε 1ε F Pp! F Pp ε 1 + ε 1 ε 1 4! F Pp ε 1! 4 Distict Eemets There are strog ower bouds for the accuracy of estimatig the umber of distict vaues through radom sampig. The foowig theorem is from Charikar et a. [7], which we have restated sighty to fit our otatio the origia theorem is about database tabes. Let F 0 be the umber of eemets i a data set T of tota size. Note that T maybe a stored data set, ad eed ot be processed i a oe-pass streamig maer. Theorem 3 Charikar et a. [7] Cosider ay radomized estimator ˆF 0 for the umber of distict vaues F 0 of T, that examies at most r out of the eemets i T. For ay γ > e r, there exists a choice of the iput T such that with probabiity at east γ, the mutipicative error is at east r/rγ 1.

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 13 The above theorem impies that if we observe o eemets of P, the it is ot possibe to get eve a estimate with a costat mutipicative error. This ower boud for the ostreamig mode eads to the foowig ower boud for samped streams. Theorem 4 F 0 Lower Boud For sampig probabiity p 0,1/1], ay agorithm that estimates F 0 by observig L, there is a iput stream such that the agorithm wi have a mutipicative error of Ω 1/ p with probabiity at east 1 e p /. Proof Let E 1 deote the evet L 6p. Let β deote the mutipicative error of ay agorithm perhaps o-streamig that estimates F 0 P by observig L. Let α = 1p. Let E deote the evet β α. Note that L is a biomia radom variabe. The expected size of the samped stream is E[ L ] = p. By usig a Cheroff boud: Pr[E 1 ] = 1 Pr[ L > 6E[ L ]] 1 6E[ L ] > 1 e p If E 1 is true, the the umber of eemets i the samped stream is o more tha 6p. Substitutig r = 6p ad γ = 1/ i Theorem 3, we get: [ ] 6p Pr[E E 1 ] Pr β > 1p E 1 1 Simpifyig, ad usig p 1/1, we get: Pr[E ] Pr[E 1 E ] = Pr[E 1 ] Pr[E E 1 ] 1 1 e p We ow describe a simpe streamig agorithm for estimatig F 0 P by observig LP, p, which has a error of O1/ p with high probabiity. Agorithm : F 0 P 1 Let X deote a 1/,δ-estimate of F 0 L, derived usig ay streamig agorithm for F 0 such as [9]. Retur X/ p Lemma 8 F 0 Upper Boud Agorithm returs a estimate Y for F 0 P such that the mutipicative error of Y is o more tha 4/ p with probabiity at east 1 δ +e pf 0P/8. Proof Let D = F 0 P, ad D L = F 0 L. Let E 1 deote the evet D L pd/, E deote X D L /, ad E 3 deote the evet X 3D L /. Let E = 3 E i. Without oss of geeraity, et 1,,...,D deote the distict items that occurred i stream P. Defie X i = 1 if at east oe copy of item i appeared i L, ad 0 otherwise. The differet X i s are a idepedet. Thus D L = D X i is a the sum of idepedet Beroui radom variabes ad E[D L ] = D Pr[X i = 1].

14 Adrew McGregor et a. Sice each copy of item i is icuded i D L with probabiity p, we have Pr[X i = 1] p. Thus, E[D L ] pd. Appyig a Cheroff boud, Pr [ E 1 ] = Pr [ D L < pd Suppose E is true. The we have the foowig: ] [ Pr D L < E[D ] L] e E[DL]/8 e pd/8. 6 pd 4 D L X 3D L 3D The ast iequaity is because D L is at most D. Therefore X/ p has a mutipicative error of o more tha 4/ p. We ow boud the probabiity that E is fase. Pr [ E ] 3 Pr [ E i ] δ + e pd/8 where we have used the uio boud, Eq. 6, ad the fact that X is a 1/,δ-estimator of D L. 5 Etropy I this sectio we cosider approximatig the etropy of a stream. Defiitio 3 The etropy of a frequecy vector f = f 1, f,..., f m is defied as Hf = m f i g f i where = m f i. Ufortuatey, i cotrast to F 0 ad F k, it is ot possibe to mutipicativey approximate Hf eve if p is costat. Lemma 9 No mutipicative error approximatio is possibe with probabiity 9/10 eve with p > 1/. Furthermore, 1. There exists f such that Hf = Θog/p but Hg = 0 with probabiity at east 9/10.. There exists f such that Hf Hg gp with probabiity at east 9/10. Proof First cosider the foowig two scearios for the cotets of the stream. I Sceario 1, f 1 = ad i Sceario, f 1 = k ad f = f 3 =... = f k+1 = 1. I the first case the etropy Hf = 0 whereas i the secod, Hf = k ge k + k g = k Θk/ k + k g = Θ1 + g k.

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 15 Distiguishig these streams requires that at east oe vaue other that 1 is preset i the subsamped stream. This happes with probabiity 1 p k > 1 pk ad hece with k = p 1 /10 this probabiity is ess tha 9/10. For the secod part of the emma cosider the stream with f 1 = f =... = f m = 1 ad hece Hf = gm. But Hg = g L where L is the umber of eemets i the samped stream. By a appicatio of the Cheroff boud L is at most pm with probabiity at east 9/10 ad the resut foows. Istead we wi show that it is possibe to approximate Hf up to a costat factor with a additioa additive error term that teds to zero if p = ω 1/3. It wi aso be coveiet to cosider the foowig quatity: H p g = m g i p g. p g i The foowig propositios estabishes that H p g is a very good approximatio to Hg. Propositio 1 With probabiity 199/00, H p g Hg = Oogm/ p. Proof By a appicatio of the Cheroff boud, with probabiity 199/00 p m g i c p for some costat c > 0. Hece, if = m g i ad γ = /p it foows that γ = 1 ± O1/ p. The H p g = m g i p g = p g i m γg i g γg i = Hg + O1/ p + OHg/ p. The ext emma estabishes that the etropy of g is withi a costat factor of the etropy of f pus a sma additive term. Lemma 10 With probabiity 99/100, if p = ω 1/3, 1. H p g OHf.. H p g Hf/ O 1 p 1/ 1/6 Proof For the first part of the emma, first ote that E[H p g] = m [ ] gi p E g p g i m E[g i ] p g p m E[g i ] = p f i p g = Hf p p f i where the iequaity foows from Jese s iequaity sice the fuctio xgx 1 is cocave. Hece, by Markov s iequaity Pr[H p g 100Hf] 99/100.

16 Adrew McGregor et a. To prove the secod part of the emma, defie f = cp 1 ε og for some sufficiety arge costat c ad ε 0,1. We the partitio [m] ito A = {i : f i < f } ad B = {i : f i f } ad cosider Hf = H A f + H B f where H A f i f = i A g ad H B f i f = f i i B g. f i By appicatios of the Cheroff ad uio bouds, with probabiity at east 99/300, { ε p f if i A g i p f i ε p f i if i B. Hece, Hpg B g i p = g = i B p g i i B f i 1 ± ε g = 1 ± εh B f + Oε. 1 ± ε f i For H A pg we have two cases depedig o whether i A f i is smaer or arger tha θ := cp 1 ε. If i A f i θ the H A f i f = i A g θ g. f i O the other had if i A f i θ the by a appicatio of the Cheroff boud, ad hece i A g i p i A Hpg A g i = i A p g Combiig the above cases we deduce that g p g i 1 + ε f i A f i ε p f i i A g i p 1 εg 1 + ε f i A g1 + ε f 1 ε g f i H A f. H p g 1 ε gp 1 ε og Hf Oε ε. g p Settig ε = p 1/ 1/6 we get H p g 1 p 1/ 1/6 g1/3 og og Hf Op 1/ 1/6 O g /3 Hf/ Op 1/ 1/6. Therefore, by usig a existig etropy estimatio agorithm e.g., [5] to mutipicativey estimate Hg we have a costat factor approximatio to Hf if Hf = ωp 1/ 1/6. The ext theorem foows directy from Propositio 1 ad Lemma 10. Theorem 5 It is possibe to approximate Hf up to a costat factor i Opoyogm, space if Hf = ωp 1/ 1/6.

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 17 6 Heavy Hitters There are two commo otios for fidig heavy hitters i a stream: the F 1 -heavy hitters, ad the F -heavy hitters. Defiitio 4 I the F k -heavy hitters probem, k {1,} we are give a stream of updates to a uderyig frequecy vector f ad parameters α,ε, ad δ. The agorithm is required to output a set S of O1/α items such that: 1 every item i for which f i αf k 1/k is icuded i S, ad ay item i for which f i < 1 εαf k 1/k is ot icuded i S. The agorithm is additioay required to output approximatios f i with i S, f i [1 ε f i,1 + ε f i ]. The overa success probabiity shoud be at east 1 δ. The ituitio behid the agorithm for heavy hitters is as foows. Suppose a item i was a F k heavy hitter i the origia stream P, i.e. f i αf k 1/k. The, by a Cheroff boud, it ca be argued that with high probabiity, g i the frequecy of i i the samped stream is cose to p f i. I such a case, it ca be show that i is aso a heavy hitter i the samped stream ad wi be detected by a agorithm that idetifies heavy hitters o the samped stream with the right choice of parameters. Simiary, it ca be argued that a item i such that f i < 1 εαf k 1/k caot reach the required frequecy threshod o the samped stream, ad wi ot be retured by the agorithm. We preset the aaysis beow assumig that the heavy hitter agorithm o the samped stream is the CoutMi sketch. Other agorithms for heavy hitters ca be used too, such as the Misra-Gries agorithm [33]; ote that the Misra- Gries agorithm works o isert-oy streams, whie the CoutMi sketch works o geera update streams, with additios as we as deetios. Theorem 6 Suppose that F 1 P Cp 1 α 1 ε og/δ for a sufficiety arge costat C > 0. There is a oe pass streamig agorithm which observes the samped stream L ad computes the F 1 heavy hitters of the origia stream P with probabiity at east 1 δ. This agorithm uses Oε 1 og /αδ bits of space. Proof The agorithm rus the CoutMiα,ε,δ agorithm of [15] for fidig the F 1 - heavy hitters probem o the samped stream, for α = 1 ε/5 α, ε = ε/, ad δ = δ/4. We retur the set S of items i foud by CoutMi, ad we scae each of the f i by 1/p. Reca that g i the frequecy of item i i the samped stream L. The for sufficiety arge C > 0 give i the theorem statemet, for ay i, by a Cheroff boud, [ { Pr g i > max p 1 + ε f i, 5 C }] ε og δ δ 4. By a uio boud, with probabiity at east 1 δ/4, for a i [], { g i max p 1 + ε f i, 5 C } ε og. 7 δ

18 Adrew McGregor et a. We aso eed the property that if f i 1 εαf 1 P, the g i p1 ε/5 f i. For such i, by the premise of the theorem we have E[g i ] p1 εαf 1 P C1 εε og/δ. Hece, for sufficiety arge C, appyig a Cheroff ad a uio boud is eough to cocude that with probabiity at east 1 δ/4, for a such i, g i p1 ε/5 f i. We set the parameter δ of CoutMi to equa δ/4, ad so CoutMi succeeds with probabiity at east 1 δ/4. Aso, E[[F 1 L] = pf 1 P Cα 1 ε og/δ, the iequaity foowig from the premise of the theorem. By a Cheroff boud, [ Pr 1 ε pf 1 P F 1 L 1 + ε ] pf 1 P 1 δ 5 5 4. By a uio boud, a evets discussed thus far joity occur with probabiity at east 1 δ, ad we coditio o their joit occurrece i the remaider of the proof. Lemma 11 If f i αf 1 P, the g i 1 ε/5 αf 1 L. If f i < 1 εαf 1 P, the g i 1 ε/αf 1 L. Proof Sice g i p1 ε/5 f i ad aso F 1 L p1 + ε/5f 1 P. Hece, g i 1 ε/5 1 + ε/5 αf 1L 1 ε/5 αf 1 L. Next cosider ay i for which f i < 1 εαf 1 P. The { g i max p 1 + ε 1 εαf 1 P, 5 { max 1 3ε αf 1 L, 5 { max { max 1 ε 1 ε 1 ε αf 1 L. C } ε og δ C } ε og δ αf 1 L, α } E[F 1L] αf 1 L, 1 + ε α } 5 F 1L It foows that by settig α = 1 ε/5 α ad ε = ε/, CoutMiα,ε,δ does ot retur ay i S for which f i < 1 εαf 1 P, sice for such i we have g i 1 ε/αf 1 L, ad so g i < 1 ε/10α F 1 L. O the other had, for every i S for which f i αf 1 P, we have i S, sice for such i we have g i α F 1 L. It remais to show that for every i S, we have f i [1 ε f i,1+ε f i ]. By the previous paragraph, for such i we have f i 1 εαf 1 P. By the above coditioig, this meas

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 19 that g i p1 ε/5 f i. We wi aso have g i p1 + ε/5 f i if p 1 + ε 5 fi C og ε δ. Sice f i 1 εαf 1 P, this i tur hods if F 1 P 1 1 ε1 + ε/5 Cp 1 α 1 ε og, δ which hods by the theorem premise provided ε is ess tha a sufficiety sma costat. This competes the proof. Theorem 7 Suppose that F 1/ Cp 3/ α 1 ε og/δ ad p = Ωm 1/. There is a oe pass streamig agorithm which observes the samped stream L ad computes α, 1 p 1/ 1 ε F -heavy hitters of the origia stream with high probabiity. Proof The agorithm rus the CoutSketchα,ε,δ agorithm [8] for fidig the F -heavy hitters o the samped stream, for appropriate α,ε, ad δ specified beow. We retur the set S of items i foud by CoutSketch. As before we ca show that if f i 1 εαf 1/, the with probabiity at east 1 δ/4, g i p1 ε/5 f i. Next we boud the variace of F L. Sice each g i is draw from a biomia distributio Bi f i, p o f i items with probabiity p, Moreover, E[F L] = Var[F L] = E [ g i ] = Var[g i ] It is kow that the 4-th momet of Bi f i, p is p fi + p1 p f i = p F P + p1 pf 1 P. E [ g i 4] p fi + p1 p f i E [ g i 4] p 4 f 4 i. f i p1 7p + 7 f i p + 1p 18 f i p + 6 f i p 6p 3 + 11 f i p 3 6 f i p 3 + f 3 i p 3, ad subtractig p 4 f 4 i from this, we obtai f i p 7 f i p + 7 f i p + 1 f i p 3 18 f i p 3 + 6 f 3 i p 3 6 f i p 4 + 11 f i p 4 6 f 3 i p 4 + f 4 i p 4 f 4 i p 4, which is O f i p + fi p + fi 3 p 3. Hece, Var[F L] = OpF 1 + p F P + p 3 F 3 P. By Chebyshev s iequaity, Pr [ F L E[F L] ε p ] pf1 + p F + p 3 F 3 F = O ε p 4 F 1 = O ε + 1 pf ε + pf3/ F ε F F1 = O ε pf Thus with probabiity at east 1 δ/4 1 ε pf 1/ F L 1/ p 1/ F 1/. 5 = O + 1 ε + pf 3 F ε F 1 ε + 1 pf ε + F By uio boud a evets discussed so far joity occur with probabiity at east 1 δ, ad we coditio o them occurrig i the remaider of the aaysis. p ε F 1/

0 Adrew McGregor et a. Suppose that f i αf 1/ i the origia stream. The g i p1 ε/5 f i αf 1/ p1 ε/5 α p 1/ 1 ε/5f 1/ L Next cosider ay i for which f i < 1 εp 1/ αf 1/. The { g i max p 1 + ε 1 εp 1/ αf 1/ P, 5 { max 1 + ε 1 3ε 5 1 4ε 5 5 1 ε C } ε og δ 1 ε p 3/ αf 1/ P, 5 p 3/ F 1/ P 5 1 ε p 1/ αf L 1/ C ε og δ } It foows that by settig α = 1 ε/5 α p 1/, δ = δ/4, ad ε = ε/10, CoutSketchα,ε,δ does ot retur ay i S for which f i < 1 εp 1/ αf 1/ P, sice for such i we have g i 1 ε/p 1/ αf L 1/. O the other had, for every i S for which f i αf 1/, we have i S, sice for such i we have g i α F L 1/. 7 Cocusio We preseted sma-space stream agorithms ad ower bouds for estimatig fuctios of iterest whe observig a radom sampe of the origia stream. The are umerous directios for future work, ad we metio some of them. As we have see, our resuts impy time/space tradeoffs for severa atura streamig probems. What other data stream probems have iterestig time/space tradeoffs? Aso, we have so far assumed that the sampig probabiity p is fixed, ad that the agorithm has o cotro over it. Suppose this was ot the case, ad the agorithm ca chage the sampig probabiity i a adaptive maer, depedig o the curret state of the stream. Is it possibe to get agorithms that ca observe fewer eemets overa ad get the same accuracy as our agorithms? For which precise modes ad probems is adaptivity usefu? It is aso iterestig to obtai matchig space ower bouds for the case of estimatig frequecy momets. Refereces 1. Ao, N., Matias, Y., Szegedy, M.: The Space Compexity of Approximatig the Frequecy Momets. Joura of Computer ad System Scieces 581, 137 147 1999. Babcock, B., Datar, M., Motwai, R.: Sampig from a movig widow over streamig data. I: Proc. ACM-SIAM Symposium o Discrete Agorithms SODA, pp. 633 634 00 3. Bar-Yossef, Z.: The compexity of massive dataset computatios. Ph.D. thesis, Uiversity of Caiforia at Berkeey 00

Space-Efficiet Estimatio of Statistics over Sub-Samped Streams 1 4. Bar-Yossef, Z.: Sampig ower bouds via iformatio theory. I: Proc. 35th Aua ACM Symposium o Theory Of Computig STOC, pp. 335 344 003 5. Barakat, C., Iaaccoe, G., Diot, C.: Rakig fows from samped traffic. I: Proc. ACM Coferece o Emergig Network Experimet ad Techoogy CoNEXT, pp. 188 199 005 6. Bhattacharyya, S., Madeira, A., Muthukrisha, S., Ye, T.: How to scaaby ad accuratey skip past streams. I: Proc. 3rd Iteratioa Coferece o Data Egieerig ICDE Workshops, pp. 654 663 007 7. Charikar, M., Chaudhuri, S., Motwai, R., Narasayya, V.R.: Towards estimatio error guaratees for distict vaues. I: Proc. 19th ACM Symposium o Pricipes of Database Systems PODS, pp. 68 79 000 8. Charikar, M., Che, K., Farach-Coto, M.: Fidig frequet items i data streams. Theoretica Computer Sciece 311, 3 15 004 9. Cisco Systems: Radom Samped NetFow. http://www.cisco.com/e/us/docs/ios/1_0s/ feature/guide/fstatsa.htm 10. Cohe, E., Cormode, G., Duffied, N.G.: Structure-aware sampig: Fexibe ad accurate summarizatio. Proceedigs of the VLDB Edowmet 411, 819 830 011 11. Cohe, E., Duffied, N.G., Kapa, H., Lud, C., Thorup, M.: Efficiet stream sampig for variaceoptima estimatio of subset sums. SIAM J. Comput. 405, 140 1431 011 1. Cohe, E., Duffied, N.G., Kapa, H., Lud, C., Thorup, M.: Agorithms ad estimators for summarizatio of uaggregated data streams. Joura of Computer ad System Scieces 807, 114 144 014 13. Cohe, E., Grossaug, N., Kapa, H.: Processig top-k queries from sampes. Computer Networks 514, 605 6 008 14. Cormode, G., Garofaakis, M.: Sketchig probabiistic data streams. I: Proc. 6th ACM Iteratioa Coferece o Maagemet of Data SIGMOD, pp. 81 9 007 15. Cormode, G., Muthukrisha, S.: A improved data stream summary: the cout-mi sketch ad its appicatios. Joura of Agorithms 551, 58 75 005 16. Cormode, G., Muthukrisha, S., Yi, K., Zhag, Q.: Optima sampig from distributed streams. I: Proc. ACM Symposium o Pricipes of Database Systems PODS, pp. 77 86 010 17. Duffied, N.G., Lud, C., Thorup, M.: Properties ad predictio of fow statistics from samped packet streams. I: Proc. Iteret Measuremet Workshop, pp. 159 171 00 18. Duffied, N.G., Lud, C., Thorup, M.: Estimatig fow distributios from samped fow statistics. IEEE/ACM Trasactios o Networkig 135, 933 946 005 19. Duffied, N.G., Lud, C., Thorup, M.: Priority sampig for estimatio of arbitrary subset sums. Joura of the ACM 546 007 0. Efraimidis, P., Spirakis, P.G.: Weighted radom sampig with a reservoir. Iformatio Processig Letters 975, 181 185 006 1. Esta, C., Keys, K., Moore, D., Varghese, G.: Buidig a better etfow. I: Proc. ACM Coferece o Appicatios, Techoogies, Architectures, ad Protocos for Computer Commuicatio SIGCOMM, pp. 45 56 004. Esta, C., Varghese, G.: New directios i traffic measuremet ad accoutig. I: Proc. ACM Coferece o Appicatios, Techoogies, Architectures, ad Protocos for Computer Commuicatio SIG- COMM, pp. 33 336 00 3. Gibbos, P.B., Matias, Y.: New sampig-based summary statistics for improvig approximate query aswers. I: Proc. ACM SIGMOD Iteratioa Coferece o Maagemet of Data, pp. 331 34 1998 4. Guha, S., Huag, Z.: Revisitig the direct sum theorem ad space ower bouds i radom order streams. I: Automata, Laguages ad Programmig, 36th Iteratioa Cooquium, ICALP 1, pp. 513 54 009 5. Harvey, N.J.A., Neso, J., Oak, K.: Sketchig ad streamig etropy via approximatio theory. I: PRoc. 49th IEEE Coferece o Foudatios Of Computer Sciece FOCS, pp. 489 498 008 6. Hoh, N., Veitch, D.: Ivertig samped traffic. IEEE/ACM Trasactios o Networkig 141, 68 80 006 7. Idyk, P., Woodruff, D.P.: Optima approximatios of the frequecy momets of data streams. I: Proc. 37th Aua ACM Symposium o Theory of Computig STOC, pp. 0 08 005 8. Jayram, T.S., McGregor, A., Muthukrisha, S., Vee, E.: Estimatig statistica aggregates o probabiistic data streams. ACM Trasactios o Database Systems 33, 6:1 6:30 008 9. Kae, D.M., Neso, J., Woodruff, D.P.: O the exact space compexity of sketchig ad streamig sma orms. I: Proc. 1st ACM-SIAM Symposium o Discrete Agorithms SODA, pp. 1161 1178 010 30. Lahiri, B., Tirthapura, S.: Stream sampig. I: L. Liu, M.T. Özsu eds. Ecycopedia of Database Systems, pp. 838 84. Spriger US 009

Adrew McGregor et a. 31. McGregor, A. ed.: Ope Probems i Data Streams ad Reated Topics 007. http://www.cse. iitk.ac.i/users/sgaguy/data-stream-probs.pdf 3. McGregor, A., Pava, A., Tirthapura, S., Woodruff, D.: Space-efficiet estimatio of statistics over subsamped streams. I: Proc. 31st ACM Symposium o Pricipes of Database Systems PODS, pp. 73 8 01 33. Misra, J., Gries, D.: Fidig repeated eemets. Sciece of Computer Programmig, 143 15 198 34. Rusu, F., Dobra, A.: Sketchig samped data streams. I: Proc. 5th IEEE Iteratioa Coferece o Data Egieerig ICDE, pp. 381 39 009 35. Szegedy, M.: The dt priority sampig is essetiay optima. I: Proc. Aua ACM Symposium o Theory of Computig STOC, pp. 150 158 006 36. Tirthapura, S., Woodruff, D.P.: Optima radom sampig from distributed streams revisited. I: Proc. Iteratioa Symposium o Distributed Computig DISC, pp. 83 97 011 37. Vitter, J.S.: Radom sampig with a reservoir. ACM Trasactios o Mathematica Software 111, 37 57 1985