Big Data & Scripting Part II Streaming Algorithms 1,
2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set), retain only elements matching that criterion example scenario: stream of requests (user,request) sampling requests is straightforward (e.g. which pages are accessed most frequently) analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x)
3, sampling and filtering example n = 200, 000 events, m = 40, 000 different requests, uniform distribution all queries 0 5 10 15 0 1 2 3 4 5 6 10% sample s 0 10000 20000 30000 40000 id
sampling and filtering example same dataset, but vs. # queries with this all queries by number of queries with 0 2000 4000 6000 0 5 10 15 number of queries with 0 5000 15000 25000 10% sample by 0 1 2 3 4 5 6 completely different distributions due to sampling 4,
5, sampling and filtering example same dataset, but vs. # queries with this this time sample is selected by a fixed subset of ids all queries by number of queries with 0 2000 4000 6000 0 5 10 15 corrected 10% sample by number of queries with 0 100 300 500 700 0 5 10 15
Histograms and Frequency Skews 6,
7, stream and histogram consider the following input: objects/buckets 6 5 4 3 2 1 time 0 20 40 60 80 100 as time/stream progresses, data points come in e.g. users issue requests distinguished by some id or bucket (from hashing) some are seen more often (e.g. 4) some less often (e.g. 1) e.g. user 4 sending requests with high, user 1 only one request this is highly valuable information for an analysis
8, stream and histogram objects/buckets 6 5 4 3 2 1 time 0 20 40 60 80 100 to analyze these distributions, histograms are helpful: 30 25 20 15 10 5 0 1 2 3 4 5 6 object
9, comparing histograms - different distributions an example of two different streams of observations: 700 700 600 600 500 400 300 200 500 400 300 200 100 100 0 0 objects objects both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed sorting objects by frequencies makes the difference more obvious: 700 700 600 600 500 400 300 200 500 400 300 200 100 100 0 0 objects objects
10, the plan information about the distribution of observation is crucial for many applications knowing the complete, exact histogram would be helpful is often not possible, due to the large number of distinct objects workaround: characterize histogram without knowing the complete picture characteristic properties easier to determine analogous to descriptions of distributions on R
11, characterizing distributions 30 25 20 15 10 5 0 1 2 3 4 5 6 object m i : of object i number of distinct objects seen so far: i(m i ) 0 total number of objects seen so far: i(m i ) 1 = i m i generalization: M k = i(m i ) k kth moment
12, M 2 the second moment what we have so far M 0 Flajolet-Martin algorithm from last lecture M 1 counting combination: average M 1 /M 0 next: estimate M 2 = i m 2 i
13, M 2 the second moment 700 700 600 600 500 400 300 200 500 400 300 200 100 100 0 0 objects M 2 = 1.678.672 objects M 2 = 3.320.852 Motivation M 2 describes the skewness of a distribution smaller M 2 less skewed distribution related to the Gini-Index (surprise index) used to limit approximation errors, query optimization in database systems
14, M 2 and Var(X) variance describes the distribution of values M 2 describes the distribution of their frequencies M 2 comparable to variance of frequencies: Var({m i }) = 1/N i(m i µ({m i })) 2
15, M 2 the second moment: approximation storing and counting distinct objects impossible approximation by Alon-Matias-Szegedy algorithm 1 : algorithm N observations in stream choose k random positions p j {1,..., N} when reaching position p j : store object at position start counting occurrences of this object in m j estimate: M 2 n/k( k i=1 (2m i 1)) 1 Alon, N.; Matias Y.; Szegedy, M.: The space complexity of approximating the moments, 1999
16, M 2 the second moment: example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 c e c f a e g f f b b c g b a a f d a e N=20 random positions 3, 7, 14, 5 position 3: encounter c, counting results in 2 position 7: encounter g, 2 position 14: b 1 position 5 a 4 estimate: M 2 20[2 (2 2 1) + (2 1 1) + (2 4 1] = 20 14 = 70 4 4 true value: M 2 = 4 2 + 3 2 + 3 2 + 1 2 + 3 2 + 4 2 + 2 2 = 64
17, M 2 the second moment: summary the algorithm is simple to implement needs to store only the k counters gets more precise with larger k, proof idea: expected value of each counter is fraction of M 2 average of k counters approaches M 2 problem: N may not be known in the beginning
18, approximating M 2 with unknown stream length stream may be of unknown length or unlimited still each position must be chosen random and uniform from {1,..., N} solution keep count of k objects beginning with the first k when object at position p > k is processed: choose with probability k/(p + 1) drop existing element (chosen with equal probability) each position chosen with equal probability
clustering data streams 19,
20, clustering data streams the problem many formulations of the clustering problem possible wide application ranges, strong variance in preconditions objective function common ground: objects connected by relation identify groups of similar objects with respect to relation problem is intractable (N P-hard) some basic questions what kind of relation (e.g. binary, distance, similarity) can objects have a mean value (continuous space) what is a good cluster (objective function) possibility of overlapping clusters
21, clustering data streams STREAM in the following: a single example problem and a single algorithm k-median on a data stream in one pass with guaranteed approximation quality algorithm: STREAM Guha, Mishra,Motwani, O Callaghan: Clustering Data Streams,2000
22, clustering data streams the k-median problem input: objects X = {x i : i = 1,..., N} distance d : X X R every x i is seen once in arbitrary order (i = 1,..., N) k - number of clusters to find objective: identify k elements m 1,..., m k X (cluster centers) let N(m j ) = {x i X : j = arg min l 1,...,k d(x i, m l )} all x i for which m i is the nearest center minimize C({m 1,..., m k }) = k j=1 x i N(m j ) d(x i, m j )
23, clustering data streams approximating k-median for small problem instances k-median can be fixed parameter approximated fixed parameter approximation: C approx a Q opt (approximation is maximal by factor a worse than optimal solution for fixed a) this approximation is useful to approximate larger instances approximation (idea) k-medians can be stated as integer program P I this program can be relaxed to a linear program P L solution of P L can be rounded to solution of P I linear problems can be solved efficiently
clustering data streams weighted k-medians extending k-medians with weights: k-medians with weighted samples w : X R >0 : distance of objects to their centers multiplied by weight: C({m 1,..., m k }) = j i 1,...,N w(x i ) d(x i, m j ) k-medians is special case with unit weights weighted k-means can be approximated similar to k-means: algorithm can only be applied to small instances use it to solve small sub-problems in the following, use procedure: wkm() input: objects, weights, k output: k weighted centers runtime: O(n 2 ) 24,
25, first step - clustering with low memory approach: divide and conquer Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into l k clusters 3. result: X set of lk cluster centers 4. cluster X, using for each c X N(c) as weight 2. can be solved with a constant factor approximation: solution b times worse than optimum 4. can be solved with constant factor approximation not worse than c times optimum result: constant factor approximation partial solutions and their combination
26, extending to a solution Small-Space(X) 1. divide X into l disjoint subsets X 1,..., X l 2. cluster each X i individually into O(k) clusters 3. result: X set of O(lk) cluster centers 4. cluster X, using for each c X N(c) as weight constant factor approximation needs to cluster X i memory problem 1: size of subsets versus l needs to cluster X memory problem 2: clustering O(lk) elements