# Detecting Change in Data Streams

Save this PDF as:

Size: px
Start display at page:

## Transcription

2 the distribution of beam strengths over time. Changes in this distribution can signal the development of a problem or give evidence that a new manufacturing technique is creating an improvement. Information that describes the change could also help in analysis of the technique. Data Mining. Suppose a data stream mining algorithm is creating a data mining model that describes certain aspects of the data stream. The data mining model does not need to change as long as the underlying data distribution is stable. However, for sufficiently large changes in the distribution that generates the data stream, the model will become inaccurate. In this case it is better to completely remove the contributions of old data (which arrived before the change) from the model rather than to wait for enough new data to come in and outweigh the stale data. Further information about where the change has occurred could help the user avoid rebuilding the entire model if the change is localized, it may only be necessary to rebuild part of the model. 1.2 Statistical Requirements Recall that our basic approach to change detection in data streams uses two sliding windows over the data stream. This reduces the problem of detecting change over a data stream to the problem of testing whether the two samples in the windows were generated by different distributions. Consequently, we start by considering the case of detecting a difference in distribution between two input samples. Assume that two datasets S 1 and S 2 where generated by two probability distributions, P 1, P 2. A natural question to ask is: Can we infer from S 1 and S 2 whether they were generated by the same distribution P 1 = P 2, or is it the case that P 1 P 2? To answer this question, we need to design a test that can tell us whether P 1 and P 2 are different. Assuming that we take this approach, what are our requirements on such a test? Our first requirement is that the test should have two-way guarantees: We want the guarantee that our test detects true changes with high probability, i.e.,it should have few false negatives. But we also want the test to notify us only if a true change has occurred, i.e.,it should have few false positives. Furthermore we need to be able to extend those guarantees from the two-sample problem to the data stream. There are also practical considerations that affect our choice of test. If we know that the distribution of the data has a certain parametric form (for example, it is a normal distribution), then we can draw upon decades of research in the statistics community where powerful tests have been developed.unfortunately, real data is rarely that well-behaved in practice; it does not follow nice parametric distributions. Thus we need a non-parametric test that makes no assumptions on the form of the distribution. Our last requirement is motivated by the users of our test. A user not only wants to be notified that the underlying distribution has changed, but she also wants to know how it has changed. Thus we want a test that does not only detects changes, but also describes the change in a user-understandable way. In summary, we want a change detection test with these four properties: the rate of spurious detections (false positives) and the rate of missed detections (false negatives) can be controlled, it is non-parametric, and it provides a description of the detected change. 1.3 An Informal Motivation of our Approach Let us start by discussing briefly how our work relates to the most common non-parametric statistical test for change detection, the Wilcoxon test [24]. (Recall that parametric tests are unsuitable as we do not want to restrict the applicability of our work to data that follows certain parametric distributions.) The Wilcoxon test is a statistical test that measures the tendency of one sample to contain values that are larger than those in the other sample. We can control the probability that the Wilcoxon test raises a false alarm. However, this test only permits the detection of certain limited kinds of changes, such as a change in the mean of a normal distribution. Furthermore, it does not give a meaningful description of the changes it detects. We address the issue of two-way guarantees by using a distance function (between distributions) to help describe the change. Given a distance function, our test provides guarantees of the form: to detect a distance > ɛ between two distributions P 1 and P 2, we will need samples of at most n points from the each of P 1 and P 2. Let us start by considering possible distance functions. One possibility is to use powerful information theoretic distances, such as the Jensen-Shannon Divergence (JSD) [19]. However, to use these measures we need discrete distributions and it may also be hard to explain the idea of entropy to the average end-user. There exist many other common measures of distance between distributions, but they are either too insensitive or too sensitive. For example, the commonly used L 1 distance is too sensitive and can require arbitrarily large samples to determine if two distributions have L 1 distance > ɛ [4]. At the other extreme, L p norms (for p > 1) are far too insensitive: two distributions D 1 and D 2 can be close in the L p norm and yet can have this undesirable property: all events with nonzero probability under D 1 have 0 probability under D 2. Based on these shortcomings of existing work, we introduce in Section 3 a new distance metric that is specifically tailored to find distribution changes while providing strong statistical guarantees with small sample sizes. In addition, users are usually not interested in arbitrary change, but rather in change that has a suc- 181

3 cinct, representation that they can understand. Thus we restrict our notion of change to showing the hyperrectangle (or, in the special case of one attribute, the interval) in the attribute space that is most greatly affected by the change in distribution. We formally introduce this notion of change in Section Our Contributions In this paper we give the first formal treatment of change detection in data streams.our techniques assume that data points are generated independently but otherwise make no assumptions about the generating distribution (i.e. the techniques are nonparametric). They give provable guarantees that the change that is detected is not noise but statistically significant, and they allow us to describe the change to a user in a succinct way. To the best of our knowledge, there is no previous work that addresses all of these requirements. In particular, we address three aspects of the problem: (1) We introduce a novel family of distance measures between distributions; (2) we design an algorithmic set up for change detection in data streams; and (3) we provide both analytic and numerical performance guarantees on the accuracy of change detection. The remainder of this paper is organized as follows. After a description of our meta-algorithm in Section 2, we introduce novel metrics over the space of distributions and show that they avoid statistical problems common to previously known distance functions (Section 3). We then show how to apply these metric to detect changes in the data stream setting, and we give strong statistical guarantees on the types of changes that are detected (Section 4). In Section 5 we develop algorithms that efficiently find the areas where change has occurred, and we evaluate our techniques in a thorough experimental analysis in Section 6. 2 A Meta-Algorithm For Change Detection In this section we describe our meta-algorithm for change detection in streaming data. The metaalgorithm reduces the problem from the streaming data scenario to the problem of comparing two (static) sample sets. We consider a datastream S to be a sequence < s 1, s 2, > where each item s i is generated by some distribution P i and each s i is independent of the items that came before it. We say that a change has occurred if P i P i+1, and we call time i + 1 a change point 1. We also assume that only a bounded amount of memory is available, and that in general the size of the data stream is much larger than the amount of available memory. 1 It is not hard to realize that no algorithm can be guaranteed to detect any such change point. We shall therefore require the detection of change only when the difference between P i and P i+1 is above some threshold. We elaborate on this issue in section 3. Algorithm 1 : FIND CHANGE 1: for i = 1... k do 2: c 0 0 3: Window 1,i first m 1,i points from time c 0 4: Window 2,i next m 2,i points in stream 5: end for 6: while not at end of stream do 7: for i = 1... k do 8: Slide Window 2,i by 1 point 9: if d(window 1,i, Window 2,i ) > α i then 10: c 0 current time 11: Report change at time c 0 12: Clear all windows and GOTO step 1 13: end if 14: end for 15: end while Note that the meta-algorithm above is actually running k independent algorithms in parallel - one for each parameter triplet (m 1,i, m 2,i, α i ). The metaalgorithm requires a function d, which measures the discrepancy between two samples, and a set of triples {(m 1,1, m 2,1, α 1 ),..., (m 1,k, m 2,k, α k )}. The numbers m 1,i and m 2,i specify the sizes of the i th pair of windows (X i, Y i ). The window X i is a baseline window and contains the first m 1,i points of the stream that occurred after the last detected change. Each window Y i is a sliding window that contains the latest m 2,i items in the data stream. Immediately after a change has been detected, it contains the m 2,i points of the stream that follow the window X i. We slide the window Y i one step forward whenever a new item appears on the stream. At each such update, we check if d(x i, Y i ) > α i. Whenever the distance is > α i, we report a change and then repeat the entire procedure with X i containing the first m 1,i points after the change, etc. The meta-algorithm is shown in Figure 1. It is crucial to keep the window X i fixed while sliding the window Y i, so that we always maintain a reference to the original distribution. We use several pairs of windows because small windows can detect sudden, large changes while large windows can detect smaller changes that last over longer periods of time. The key to our scheme is the intelligent choice of distance function d and the constants α i. The function d must truly quantify an intuitive notion of change so that the change can be explained to a non-technical user. The choice of such a d is discussed in Section 3. The parameter α i defines our balance between sensitivity and robustness of the detection. The smaller α i is, the more likely we are to detect small changes in the distribution, but the larger is our risk of false alarm. We wish to provide statistical guarantees about the accuracy of the change report. Providing such guarantees is highly non-trivial because of two reasons: we have no prior knowledge about the distributions 182

4 and the changes, and the repeated testing of d(x i, Y i ) necessarily exhibits the multiple testing problem - the more times you run a random experiment, the more likely you are to see non-representative samples. We deal with these issues in section 3 and section 4, respectively. 3 Distance Measures for Distribution Change In this section we focus on the basic, two-sample, comparison. Our goal is to design algorithms that examine samples drawn from two probability distributions and decide whether these distributions are different. Furthermore, we wish to have two-sided performance guarantees for our algorithms (or tests). Namely, results showing that if the algorithm accesses sufficiently large samples then, on one hand, if the samples come from the same distributions then the probability that the algorithm will output CHANGE is small, and on the other hand, if the samples were generated by different distributions, our algorithm will output CHANGE with high probability. It is not hard to realize that no matter what the algorithm does, for every finite sample size there exist a pair of distinct distributions such that, with high probability, samples of that size will not suffice for the algorithm to detect that they are coming from different distributions. The best type of guarantee that one can conceivably hope to prove is therefore of the type: If the distributions generating the input samples are sufficiently different, then sample sizes of a certain bounded size will suffice to detect that these distributions are distinct. However, to make such a statement precise, one needs a way to measure the degree of difference between two given probability distributions. Therefore, before we go on with our analysis of distribution change detection, we have to define the type of changes we wish to detect. This section addresses this issue by examining several notions of distance between probability distributions. The most natural notion of distance (or similarity) between distributions is the total variation or the L 1 norm. Given two probability distributions, P 1, P 2 over the same measure space (X, E) (where X is some domain set and E is a collection of subsets of X - the measurable subsets), the total variation distance between these distributions is defined as T V (P 1, P 2 ) = 2 sup E E P 1 (E) P 2 (E) (or, equivalently, when the distributions have density functions, f 1, f 2, respectively, the L 1 distance between the distributions is defined by f 1 (x) f 2 (x) dx). Note that the total variation takes values in the interval [0, 1]. However, for practical purposes the total variation is an overly sensitive notion of distance. First, T V (P 1, P 2 ) may be quite large for distributions that should be considered as similar for all practical purposes (for example, it is easy to construct two distributions that differ, say, only on real numbers whose 9th decimal point is 5, and yet their total variation distance is 0.2). The second, related, argument against the use of the total variation distance, is that it may be infeasibly difficult to detect the difference between two distributions from the samples they generate. Batu et al [4] prove that, over discrete domains of size n, for every sample-based change detection algorithm, there are pairs of distribution that have total variation distance 1/3 and yet, if the sample sizes are below O(n 2/3 ), it is highly unlikely that the algorithm will detect a difference between the distributions. In particular, this means that over infinite domains (like the real line) any sample based change detection algorithm is bound to require arbitrarily large samples to detect the change even between distributions whose total variation distance is large. We wish to employ a notion of distance that, on one hand captures practically significant distribution differences, and yet, on the other hand, allows the existence of finite sample based change detection algorithms with proven detection guarantees. Our solution is based upon the idea of focusing on a family of significant domain subsets. Definition 1. Fix a measure space and let A be a collection of measurable sets. Let P and P be probability distributions over this space. The A-distance between P and P is defined as d A (P, P ) = 2 sup P (A) P (A) A A We say that P, P are ɛ-close with respect to A if d A (P, P ) ɛ. For a finite domain subset S and a set A A, let the empirical weight of A w.r.t. S be S(A) = S A S For finite domain subsets, S 1 and S 2, we define the empirical distance to be d A (S 1, S 2 ) = 2 sup S 1 (A) S 2 (A) A A The intuitive meaning of A-distance is that it is the largest change in probability of a set that the user cares about. In particular, if we consider the scenario of monitoring environmental changes spread over some geographical area, one may assume that the changes that are of interest will be noticeable in some local regions and thus be noticeable by monitoring spatial rectangles or circles. Clearly, this notion of A-distance is a relaxation of the total variation distance. It is not hard to see that A-distance is always the total variation and therefore is less restrictive. This 183

5 point helps get around the statistical difficulties associated with the L 1 norm. If A is not too complex 2, then there exists a test t that can distinguish (with high probability) if two distributions are ɛ-close (with respect to A) using a sample size that is independent of the domain size. For the case where the domain set is the real line, the Kolmogorov-Smirnov statistics considers sup F 1 (x) F 2 (x) as the measure of difference between two distributions (where F i (x) = P i ({y : y x x})). By setting A to be the set of all the onesided intervals (, x) the A distance becomes the Kolmogorov-Smirnov statistic. Thus our notion of distance, d A can be viewed as a generalization of this classical statistics. By picking A to be a family of intervals (or, a family of convex sets for higher dimensional data), the A-distance reflects the relevance of locally centered changes. Having adopted the concept of determining distance by focusing on a family of relevant subsets, there are different ways of quantifying such a change. The A measure defined above is additive - the significance of a change is measured by the difference of the weights of a subset between the two distributions. Alternatively, one could argue that changing the probability weight of a set from 0.5 to 0.4 is less significant than the change of a set that has probability weight of 0.1 under P 1 and weight 0 under P 2. Next, we develop a variation of notion of the A distance, called relativized discrepancy, that takes the relative magnitude of a change into account. As we have clarified above, our aim is to not only define sensitive measures of the discrepancy between distributions, but also to provide statistical guarantees that the differences that these measures evaluate are detectable from bounded size samples. Consequently, in developing variations of the basic d A measure, we have to take into account the statistical tool kit available for proving convergence of sample based estimates to true probabilities. In the next paragraph we outline the considerations that led us to the choice of our relativized discrepancy measures. Let P be some probability distribution and choose any A A, let p be such that P (A) = p. Let S be a sample with generated by P and let n denote its size. Then ns(a) behaves like the sum S n = X X n of S independent binomial random variables with P (X i = 1) = p and P (X i = 0) = 1 p. We can use Chernoff bounds [16] to approximate that tails of the distribution of S n : P [S n /n (1 + ɛ)p] e ɛ2 np/3 P [S n /n (1 ɛ)p] e ɛ2 np/2 (1) (2) Our goal is to find an expression for ɛ as a func- 2 there is a formal notion of this complexity - the VCdimension. We discuss it further in Section 3. tion ω(p) so that the rate of convergence is approximately the same for all p. Reasoning informally, P (p S n /n pω(p)) e ω(p)2np/2 and the right hand side is constant if ω(p) = 1/ p. Thus P [(p S n /n)/ p > ɛ] should converge at approximately the same rate for all p. If we look at the random variables X 1,..., X n (where X i = 1 X i) we see that S n = X i = n S n is a binomial random variable with parameter 1 p. Therefore the rate of convergence should be the same for p and 1 p. To make the above probability symmetric in p and 1 p, we can either change the denominator to min(p, 1 p) or p(1 p). The first way is more faithful to the Chernoff bound. The second approach approximates the first approach when p is far from 1/2. However, the second approach gives more relative weight to the case when p is close to 1/2. Substituting S(A) for S n /n, P (A) for p, we get that (P (A) S(A))/ min(p (A), 1 P (A)) converges at approximately the same rate for all A such that 0 < P (A) < 1 and (P (A) S(A))/ P (A)(1 P (A)) converges at approximately the same rate for all A (when 0 < P (A) < 1). We can modify it to the two sample case by approximating P (A) in the numerator by S (A). In the denominator, for reasons of symmetry, we approximate P (A) by (S (A)) + S(A))/2. Taking the absolute values and the sup over all A A, we propose the following measures of distribution distance, and empirical statistics for estimating it: Definition 2 (Relativized Discrepancy). Let P 1, P 2 be two probability distributions over the same measure space, let A denote a family of measurable subsets of that space, and A a set in A. Define φ A (P 1, P 2 ) as sup A A P 1 (A) P 2 (A) min{ P1(A)+P2(A) 2, (1 P1(A)+P2(A) 2 )} For finite samples S 1, S 2, we define φ A (S 1, S 2 ) similarly, by replacing P i (A) in the above definition by the empirical measure S i (A) = S i A / S i. Define Ξ A (P 1, P 2 ) as sup A A P 1(A)+P 2(A) 2 P 1 (A) P 2 (A) ( 1 P1(A)+P2(A) 2 Similarly, for finite samples S 1, S 2, we define Ξ A (S 1, S 2 ) by replacing P i (A) in the above definition by the empirical measure S i (A). ) 184

6 Our experiments show that indeed these statistics tend to do better than the d A statistic because they use the data more efficiently - a smaller change in an area of low probability is more likely to be detected by these statistics than by the D A (or the KS) statistic. These statistics have several nice properties. The d A distance is obviously a metric over the space of probability distributions. So is the relativized discrepancy φ A (as long as for each pair of distribution P 1 and P 2 there exists a A A such that F 1 and P 1 (A) P 2 (A)). The proof is omitted due to space limitations. We conjecture that Ξ A is also a metric. However, the major benefit of the d A, φ A, and Ξ A statistics is that in addition to detecting change, they can describe it. All sets A which cause the relevant equations to be > ɛ are statistically significant. Thus the change can be described to a lay-person: the increase or decrease (from the first sample to the second sample) in the number of points that falls in A is too much to be accounted for by pure chance and therefore it is likely that the probability of A has increased (or decreased). 3.1 Technical preliminaries Our basic tool for sample based estimation of the A distance between probability distributions is based on the Vapnik-Chervonenkis theory. Let A denote a family of subsets of some domain set X. We define a function Π A : N N by Π A (n) = max{ {A B : A A} : B X and B = n} Clearly, for all n, Π A 2 n. For example, if A is the family of all intervals over the real line, then Π A (n) = O(n 2 ), (0.5n n, to be precise). Definition 3 (VC-Dimension). The Vapnik- Chervonenkis dimension of a collection A of sets is VC-dim(A) = sup{n : Π A (n) = 2 n } The following combinatorial fact, known as Sauer s Lemma, is a basic useful property of the function Π A. Lemma 3.1 (Sauer, Shelah). If A has a finite VCdimension, d, then for all n, Π A (n) Σi=0( d n ) i It follows that for any such A, P i A (n) < n d. In particular, for A being the family of intervals or rays on the real line, we get P i A (n) < n Statistical Guarantees for our Change Detection Estimators We consider the following scenario: P 1, P 2 are two probability distributions over the same domain X, and A is a family of subsets of that domain. Given two finite sets S 1, S 2 that are i.i.d. samples of P 1, P 2 respectively, we wish to estimate the A distance between the two distributions, d A (P 1, P 2 ). Recall that, for any subset A of the domain set, and a finite sample S, we define the S- empirical weight of A by S(A) = S A S. The following theorem follows by applying the classic Vapnik-Chervonenkis analysis [22], to our setting. Theorem 3.1. Let P 1, P 2 be any probability distributions over some domain X and let A be a family of subsets of X and ɛ (0, 1). If S 1, S 2 are i.i.d m samples drawn by P 1, P 2 respectively, then, P [ A A P 1 (A) P 2 (A) S 1 (A) S 2 (A) ɛ] It follows that < Π A (2m)4e mɛ2 /4 P [ d A (P 1, P 2 ) d A (S 1, S 2 ) ɛ] < Π A (2m)4e mɛ2 /4 Where P in the above inequalities is the probability over the pairs of samples (S 1, S 2 ) induced by the sample generating distributions (P 1, P 2 ). One should note that if A has a finite VCdimension, d, then by Sauer s Lemma, Π A (n) < n d for all n. We thus have bounds on the probabilities of both missed detections and false alarms of our change detection tests. The rate of growth of the needed sample sizes as a function of the sensitivity of the test can be further improved by using the relativized discrepancy statistics. We can get results similar to Theorem 3.1 for the distance measures φ A (P 1, P 2 ) and Ξ A (P 1, P 2 ). We start with the following consequence of a result of Anthony and Shawe-Taylor [2]. Theorem 3.2. Let A be a collection of subsets of a finite VC-dimension d. Let S be a sample of size n each, drawn i.i.d. by a probability distribution, P (over X), then P n (φ A (S, P ) > ɛ) (2n) d e nɛ2 /4 (Where P n is the n th power of P - the probability that P induces over the choice of samples). Similarly, we obtain the following bound on the probability of false alarm for the φ A (S 1, S 2 ) test. Theorem 3.3. Let A be a collection of subsets of a finite VC-dimension d. If S 1 and S 2 are samples of size n each, drawn i.i.d. by the same distribution, P (over X), then P 2n (φ A (S 1, S 2 ) > ɛ) (2n) d e nɛ2 /4 (Where P 2n is the 2n th power of P - the probability that P induces over the choice of samples). 185

### Detecting Change in Data Streams

Detecting Change in Data Streams Shai Ben-David, Johannes Gehrke, Daniel Kifer Cornell University Abstract Detecting changes in a data stream is an important area of research with many applications. In

### CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

### Tutorial 5: Hypothesis Testing

Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................

### arxiv:1112.0829v1 [math.pr] 5 Dec 2011

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

### A Brief Introduction to Property Testing

A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### The Relative Worst Order Ratio for On-Line Algorithms

The Relative Worst Order Ratio for On-Line Algorithms Joan Boyar 1 and Lene M. Favrholdt 2 1 Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, joan@imada.sdu.dk

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

### CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,

### Efficient Decision Tree Construction for Mining Time-Varying Data Streams

Efficient Decision Tree Construction for Mining Time-Varying Data Streams ingying Tao and M. Tamer Özsu University of Waterloo Waterloo, Ontario, Canada {y3tao, tozsu}@cs.uwaterloo.ca Abstract Mining streaming

### , for x = 0, 1, 2, 3,... (4.1) (1 + 1/n) n = 2.71828... b x /x! = e b, x=0

Chapter 4 The Poisson Distribution 4.1 The Fish Distribution? The Poisson distribution is named after Simeon-Denis Poisson (1781 1840). In addition, poisson is French for fish. In this chapter we will

### INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

### Data Mining on Streams

Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs

### Exact Nonparametric Tests for Comparing Means - A Personal Summary

Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola

### Lecture 4 Online and streaming algorithms for clustering

CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### Non Parametric Inference

Maura Department of Economics and Finance Università Tor Vergata Outline 1 2 3 Inverse distribution function Theorem: Let U be a uniform random variable on (0, 1). Let X be a continuous random variable

### Data Warehousing und Data Mining

Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

### Gambling Systems and Multiplication-Invariant Measures

Gambling Systems and Multiplication-Invariant Measures by Jeffrey S. Rosenthal* and Peter O. Schwartz** (May 28, 997.. Introduction. This short paper describes a surprising connection between two previously

### The CUSUM algorithm a small review. Pierre Granjon

The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 11

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According

### Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

### A Sublinear Bipartiteness Tester for Bounded Degree Graphs

A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublinear-time algorithm for testing whether a bounded degree graph is bipartite

### Fairness in Routing and Load Balancing

Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria

### R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

### Hypothesis testing. c 2014, Jeffrey S. Simonoff 1

Hypothesis testing So far, we ve talked about inference from the point of estimation. We ve tried to answer questions like What is a good estimate for a typical value? or How much variability is there

### Hypothesis Testing. 1 Introduction. 2 Hypotheses. 2.1 Null and Alternative Hypotheses. 2.2 Simple vs. Composite. 2.3 One-Sided and Two-Sided Tests

Hypothesis Testing 1 Introduction This document is a simple tutorial on hypothesis testing. It presents the basic concepts and definitions as well as some frequently asked questions associated with hypothesis

### An example of a computable

An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,

### Non-Inferiority Tests for One Mean

Chapter 45 Non-Inferiority ests for One Mean Introduction his module computes power and sample size for non-inferiority tests in one-sample designs in which the outcome is distributed as a normal random

### BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### Stationary random graphs on Z with prescribed iid degrees and finite mean connections

Stationary random graphs on Z with prescribed iid degrees and finite mean connections Maria Deijfen Johan Jonasson February 2006 Abstract Let F be a probability distribution with support on the non-negative

### How to build a probability-free casino

How to build a probability-free casino Adam Chalcraft CCR La Jolla dachalc@ccrwest.org Chris Freiling Cal State San Bernardino cfreilin@csusb.edu Randall Dougherty CCR La Jolla rdough@ccrwest.org Jason

### Load Balancing and Switch Scheduling

EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

### ORDERS OF ELEMENTS IN A GROUP

ORDERS OF ELEMENTS IN A GROUP KEITH CONRAD 1. Introduction Let G be a group and g G. We say g has finite order if g n = e for some positive integer n. For example, 1 and i have finite order in C, since

### Prime Numbers. Chapter Primes and Composites

Chapter 2 Prime Numbers The term factoring or factorization refers to the process of expressing an integer as the product of two or more integers in a nontrivial way, e.g., 42 = 6 7. Prime numbers are

### CHAPTER 3 Numbers and Numeral Systems

CHAPTER 3 Numbers and Numeral Systems Numbers play an important role in almost all areas of mathematics, not least in calculus. Virtually all calculus books contain a thorough description of the natural,

### CHAPTER 3. Sequences. 1. Basic Properties

CHAPTER 3 Sequences We begin our study of analysis with sequences. There are several reasons for starting here. First, sequences are the simplest way to introduce limits, the central idea of calculus.

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

### Globally Optimal Crowdsourcing Quality Management

Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford

### Lecture 6 Online and streaming algorithms for clustering

CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

### NEW MEXICO Grade 6 MATHEMATICS STANDARDS

PROCESS STANDARDS To help New Mexico students achieve the Content Standards enumerated below, teachers are encouraged to base instruction on the following Process Standards: Problem Solving Build new mathematical

### Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

### Lecture 1: Elementary Number Theory

Lecture 1: Elementary Number Theory The integers are the simplest and most fundamental objects in discrete mathematics. All calculations by computers are based on the arithmetical operations with integers

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### Permutation Tests for Comparing Two Populations

Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of

### WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?

WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT? introduction Many students seem to have trouble with the notion of a mathematical proof. People that come to a course like Math 216, who certainly

### Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Exploratory Data Analysis

Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

### CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are

### Can you design an algorithm that searches for the maximum value in the list?

The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We

### Towards a Reliable Statistical Oracle and its Applications

Towards a Reliable Statistical Oracle and its Applications Johannes Mayer Abteilung Angewandte Informationsverarbeitung Universität Ulm mayer@mathematik.uni-ulm.de Abstract It is shown how based on the

### Competitive Analysis of On line Randomized Call Control in Cellular Networks

Competitive Analysis of On line Randomized Call Control in Cellular Networks Ioannis Caragiannis Christos Kaklamanis Evi Papaioannou Abstract In this paper we address an important communication issue arising

### Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

### It is time to prove some theorems. There are various strategies for doing

CHAPTER 4 Direct Proof It is time to prove some theorems. There are various strategies for doing this; we now examine the most straightforward approach, a technique called direct proof. As we begin, it

### Stochastic Inventory Control

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

### A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### Sequences and Convergence in Metric Spaces

Sequences and Convergence in Metric Spaces Definition: A sequence in a set X (a sequence of elements of X) is a function s : N X. We usually denote s(n) by s n, called the n-th term of s, and write {s

### How Useful Is Old Information?

6 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 1, JANUARY 2000 How Useful Is Old Information? Michael Mitzenmacher AbstractÐWe consider the problem of load balancing in dynamic distributed

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY

CANADIAN APPLIED MATHEMATICS QUARTERLY Volume 12, Number 4, Winter 2004 ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY SURREY KIM, 1 SONG LI, 2 HONGWEI LONG 3 AND RANDALL PYKE Based on work carried out

### Faster deterministic integer factorisation

David Harvey (joint work with Edgar Costa, NYU) University of New South Wales 25th October 2011 The obvious mathematical breakthrough would be the development of an easy way to factor large prime numbers

### Guessing Game: NP-Complete?

Guessing Game: NP-Complete? 1. LONGEST-PATH: Given a graph G = (V, E), does there exists a simple path of length at least k edges? YES 2. SHORTEST-PATH: Given a graph G = (V, E), does there exists a simple

### The Trip Scheduling Problem

The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### Monte Carlo testing with Big Data

Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:

### CONTRIBUTIONS TO ZERO SUM PROBLEMS

CONTRIBUTIONS TO ZERO SUM PROBLEMS S. D. ADHIKARI, Y. G. CHEN, J. B. FRIEDLANDER, S. V. KONYAGIN AND F. PAPPALARDI Abstract. A prototype of zero sum theorems, the well known theorem of Erdős, Ginzburg

### Factoring & Primality

Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

### Analysis of Representations for Domain Adaptation

Analysis of Representations for Domain Adaptation Shai Ben-David School of Computer Science University of Waterloo shai@cs.uwaterloo.ca John Blitzer, Koby Crammer, and Fernando Pereira Department of Computer

### Introduction to Statistical Machine Learning

CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

### THE CENTRAL LIMIT THEOREM TORONTO

THE CENTRAL LIMIT THEOREM DANIEL RÜDT UNIVERSITY OF TORONTO MARCH, 2010 Contents 1 Introduction 1 2 Mathematical Background 3 3 The Central Limit Theorem 4 4 Examples 4 4.1 Roulette......................................

### Multimedia Communications. Huffman Coding

Multimedia Communications Huffman Coding Optimal codes Suppose that a i -> w i C + is an encoding scheme for a source alphabet A={a 1,, a N }. Suppose that the source letter a 1,, a N occur with relative

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.

### x if x 0, x if x < 0.

Chapter 3 Sequences In this chapter, we discuss sequences. We say what it means for a sequence to converge, and define the limit of a convergent sequence. We begin with some preliminary results about the

Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

### Nonparametric Statistics

1 14.1 Using the Binomial Table Nonparametric Statistics In this chapter, we will survey several methods of inference from Nonparametric Statistics. These methods will introduce us to several new tables

### Measuring the Performance of an Agent

25 Measuring the Performance of an Agent The rational agent that we are aiming at should be successful in the task it is performing To assess the success we need to have a performance measure What is rational

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

### Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,

### princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 6: Provable Approximation via Linear Programming Lecturer: Sanjeev Arora Scribe: One of the running themes in this course is the notion of

### Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: surajitc@microsoft.com 1 Introduction The promise of decision support systems is to exploit enterprise

### 6 Scalar, Stochastic, Discrete Dynamic Systems

47 6 Scalar, Stochastic, Discrete Dynamic Systems Consider modeling a population of sand-hill cranes in year n by the first-order, deterministic recurrence equation y(n + 1) = Ry(n) where R = 1 + r = 1

### Online and Offline Selling in Limit Order Markets

Online and Offline Selling in Limit Order Markets Kevin L. Chang 1 and Aaron Johnson 2 1 Yahoo Inc. klchang@yahoo-inc.com 2 Yale University ajohnson@cs.yale.edu Abstract. Completely automated electronic

### BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s