Big Data & Scripting Part II Streaming Algorithms

Size: px

Start display at page:

Download "Big Data & Scripting Part II Streaming Algorithms"

Grant Morgan
10 years ago
Views:

1 Big Data & Scripting Part II Streaming Algorithms 1,

2 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set), retain only elements matching that criterion example scenario: stream of requests (user,request) sampling requests is straightforward (e.g. which pages are accessed most frequently) analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x)

membership in a set), retain only elements matching that criterion example scenario: stream of requests

3 3, sampling and filtering example n = 200, 000 events, m = 40, 000 different requests, uniform distribution all queries % sample s id

uniform distribution all queries 0 5 10 15 0 1

4 sampling and filtering example same dataset, but vs. # queries with this all queries by number of queries with number of queries with % sample by completely different distributions due to sampling 4,

4000 6000 0 5 10 15 number of queries with 0 5000 15000 25000 10%

5 5, sampling and filtering example same dataset, but vs. # queries with this this time sample is selected by a fixed subset of ids all queries by number of queries with corrected 10% sample by number of queries with

of ids all queries by number of queries with 0 2000 4000 6000 0 5

6 Histograms and Frequency Skews 6,

7 7, stream and histogram consider the following input: objects/buckets time as time/stream progresses, data points come in e.g. users issue requests distinguished by some id or bucket (from hashing) some are seen more often (e.g. 4) some less often (e.g. 1) e.g. user 4 sending requests with high, user 1 only one request this is highly valuable information for an analysis

8 8, stream and histogram objects/buckets time to analyze these distributions, histograms are helpful: object

9 9, comparing histograms - different distributions an example of two different streams of observations: objects objects both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed sorting objects by frequencies makes the difference more obvious: objects objects

000) and distinct objects (60) but objects have different probabilities to be observed sorting objects by

10 10, the plan information about the distribution of observation is crucial for many applications knowing the complete, exact histogram would be helpful is often not possible, due to the large number of distinct objects workaround: characterize histogram without knowing the complete picture characteristic properties easier to determine analogous to descriptions of distributions on R

number of distinct objects workaround: characterize histogram without knowing the complete

11 11, characterizing distributions object m i : of object i number of distinct objects seen so far: i(m i ) 0 total number of objects seen so far: i(m i ) 1 = i m i generalization: M k = i(m i ) k kth moment

12 12, M 2 the second moment what we have so far M 0 Flajolet-Martin algorithm from last lecture M 1 counting combination: average M 1 /M 0 next: estimate M 2 = i m 2 i

13 13, M 2 the second moment objects M 2 = objects M 2 = Motivation M 2 describes the skewness of a distribution smaller M 2 less skewed distribution related to the Gini-Index (surprise index) used to limit approximation errors, query optimization in database systems

852 Motivation M 2 describes the skewness of a distribution smaller M 2 less skewed

14 14, M 2 and Var(X) variance describes the distribution of values M 2 describes the distribution of their frequencies M 2 comparable to variance of frequencies: Var({m i }) = 1/N i(m i µ({m i })) 2

15 15, M 2 the second moment: approximation storing and counting distinct objects impossible approximation by Alon-Matias-Szegedy algorithm 1 : algorithm N observations in stream choose k random positions p j {1,..., N} when reaching position p j : store object at position start counting occurrences of this object in m j estimate: M 2 n/k( k i=1 (2m i 1)) 1 Alon, N.; Matias Y.; Szegedy, M.: The space complexity of approximating the moments, 1999

.., N} when reaching position p j : store object at position start counting occurrences of this object in m j

16 16, M 2 the second moment: example c e c f a e g f f b b c g b a a f d a e N=20 random positions 3, 7, 14, 5 position 3: encounter c, counting results in 2 position 7: encounter g, 2 position 14: b 1 position 5 a 4 estimate: M 2 20[2 (2 2 1) + (2 1 1) + (2 4 1] = = true value: M 2 = = 64

results in 2 position 7: encounter g, 2 position 14: b 1 position 5 a 4 estimate: M 2 20[2 (2 2

17 17, M 2 the second moment: summary the algorithm is simple to implement needs to store only the k counters gets more precise with larger k, proof idea: expected value of each counter is fraction of M 2 average of k counters approaches M 2 problem: N may not be known in the beginning

$larger k, proof idea: expected value of each counter is fraction of M$

18 18, approximating M 2 with unknown stream length stream may be of unknown length or unlimited still each position must be chosen random and uniform from {1,..., N} solution keep count of k objects beginning with the first k when object at position p > k is processed: choose with probability k/(p + 1) drop existing element (chosen with equal probability) each position chosen with equal probability

.., N} solution keep count of k objects beginning with the first k when object at position p > k is

19 clustering data streams 19,

20 20, clustering data streams the problem many formulations of the clustering problem possible wide application ranges, strong variance in preconditions objective function common ground: objects connected by relation identify groups of similar objects with respect to relation problem is intractable (N P-hard) some basic questions what kind of relation (e.g. binary, distance, similarity) can objects have a mean value (continuous space) what is a good cluster (objective function) possibility of overlapping clusters

with respect to relation problem is intractable (N P-hard) some basic questions what kind of relation (e.g.

21 21, clustering data streams STREAM in the following: a single example problem and a single algorithm k-median on a data stream in one pass with guaranteed approximation quality algorithm: STREAM Guha, Mishra,Motwani, O Callaghan: Clustering Data Streams,2000

22 22, clustering data streams the k-median problem input: objects X = {x i : i = 1,..., N} distance d : X X R every x i is seen once in arbitrary order (i = 1,..., N) k - number of clusters to find objective: identify k elements m 1,..., m k X (cluster centers) let N(m j ) = {x i X : j = arg min l 1,...,k d(x i, m l )} all x i for which m i is the nearest center minimize C({m 1,..., m k }) = k j=1 x i N(m j ) d(x i, m j )

23 23, clustering data streams approximating k-median for small problem instances k-median can be fixed parameter approximated fixed parameter approximation: C approx a Q opt (approximation is maximal by factor a worse than optimal solution for fixed a) this approximation is useful to approximate larger instances approximation (idea) k-medians can be stated as integer program P I this program can be relaxed to a linear program P L solution of P L can be rounded to solution of P I linear problems can be solved efficiently

24 clustering data streams weighted k-medians extending k-medians with weights: k-medians with weighted samples w : X R >0 : distance of objects to their centers multiplied by weight: C({m 1,..., m k }) = j i 1,...,N w(x i ) d(x i, m j ) k-medians is special case with unit weights weighted k-means can be approximated similar to k-means: algorithm can only be applied to small instances use it to solve small sub-problems in the following, use procedure: wkm() input: objects, weights, k output: k weighted centers runtime: O(n 2 ) 24,

25 25, first step - clustering with low memory approach: divide and conquer Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into l k clusters 3. result: X set of lk cluster centers 4. cluster X, using for each c X N(c) as weight 2. can be solved with a constant factor approximation: solution b times worse than optimum 4. can be solved with constant factor approximation not worse than c times optimum result: constant factor approximation partial solutions and their combination

26 26, extending to a solution Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into O(k) clusters 3. result: X set of O(lk) cluster centers 4. cluster X, using for each c X N(c) as weight constant factor approximation needs to cluster X i memory problem 1: size of subsets versus l needs to cluster X memory problem 2: clustering O(lk) elements

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set