Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Transcription

1 Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven)

2 Goal: Look from the lens of theoretical CS Theory of Algorithms: Past, present and future? Some cool ideas Some work we have done.

3 Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii) quickly Traveling salesman problem (TSP) n! possibilities ( for n=60) Ideally: Polynomial running time (n 2, n 3 ) 33 cities, 1962 competition

4 70 s: Problems Polynomial time (n log n, n 2,n 3 ) E.g. Shortest path, matching, max-flow,... NP-Hard: TSP, 3-SAT, most problems (brute force 2 n = only option) Late 80 s-now: Coping with NP-hardness Approximation algorithms: Even if NP-Hard, may be a 95% optimal solution can be found in polynomial time? (very rich theory/connections)

5 3-SAT: (x 1 x 7 x 13 ) (x 2 x 1 x 4 ) (maximize number of satisfied clauses) 7/8 approximation trivial: Why?

6 3-SAT: (x 1 x 7 x 13 ) (x 2 x 1 x 4 ) (maximize number of satisfied clauses) 7/8 approximation trivial: Random assignment PCP theorem(90 s): Complexity, coding, fourier analysis, Better than 7/8 approximation implies P = NP. Good understanding: Though many questions still open. All possible 3-SAT instances 7/8- Hard instance

7 Random inputs: Very rich area Assume input chosen from some nice distribution 3-SAT: random clauses TSP: points randomly chosen in plane All possible 3-SAT instances Well- behaved random inputs 7/8- Hard instance

8 Future (1) Document Clustering New York Times database All possible instances Some ground truth: should make it easier Can encode 3-SAT Approaches: Semi-random models, HMM, smoothed analysis (just the beginning) Explain performance of heuristics: K-means; SVMs; Deep Learning

9 Future (2) Polynomial/non-polynomial view is too limited (Even n 2 time is prohibitive for huge n) Google: Figure out which documents are similar (various reasons: show diverse pages for a query) Which ad to show on a click Don t care about perfect answer. Pages change/disappear, No best answer anyway Fine-grained view of polynomial time (weak understanding)

10 Needed: New ways of thinking Discard old beliefs Beautiful new ideas emerging

11 Rest of the talk A glimpse of some ideas 1) Streaming/ fast algorithms 2) Local Partitioning 3) Clustering via eigenvalues New ways of looking at the problem

12 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { }

13 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3 }

14 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4}

15 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4}

16 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4, 2}

17 Counting Distinct elements Note: The algorithm tracks the numbers seen thus far. Question: What if it can remember only 1 number? (very limited memory) Trouble: Can barely remember anything about past. Stream: When you scan next element, no clue if already seen?

18 Seems completely hopeless? Intuition only partly right Cannot hope to count exactly. But who cares if answer is 3,425,269 or 3,425,587? Approximate counting possible!! Technique: Min-hashing (beautiful use of approximation and randomization)

19 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(2) n n

20 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n

21 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n Note: h(i) is same every time i is encountered.

22 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n Note: h(i) is same every time i is encountered.

23 Number of distinct elements Keep track of min h(i) Suppose 1 distinct element (stream = ) Min h(i) n / n n 1 n

24 Number of distinct elements Keep track of min h(i) Suppose 2 distinct elements Min h(i) n / n n 1 n If k items seen, expect min-value to be around n /(k+1) Solution: Estimate of # elements = (n / min h(i)) 1

25 Number of distinct elements Randomness could mess things up. E.g. May just 1 element, But min(h(i)) could be far. expect min(h(i)) = n /2 Standard trick: O(1) such hash functions, take median entry. Theorem: For any ε > 0, can estimate distinct elements to within 1 ± ε factor accuracy with high probability. Space = O 1 ε 2

26 A closer look Random hash function h. We need that h(i) value be same every time we see i. One has to store each h(i) some where. h(1), h(2), h(3),, h(n) need n log n space?? Did we just disguise our inherent problem? There is a fix! Key idea: Do not need full power of randomness

27 What is randomness? Do not need full randomness Pairwise independence: For any a 1, a 2, x 1, x 2 Pr [ h(x 1 ) = a 1 and h(x 2 ) = a 2 ] = 1/n n n Such an h is very simple to store h(x) = ax + b mod (p) [just need to store a and b]

28 Min-Hashing: Applications Similarity of Web pages (if mostly similar words) Google: Page -> Few min-hash values (few bits) Similar page detection: quadratic -> Linear time Sketching Complex Simple Tons of amazing applications ( several researchers )

29 The Frequency moments Stream of m numbers in range [1,n] Ex: A= , S(A) = (m 1,,m n ) summarizes the stream. m i : occurrences of i in the stream A. How to compute L p norms of S(A), 1-pass, O(log n) space? L 0 = # of distinct occurrences (non-zero entries) L 1 = # length of stream (m) ( i m i ) L 2 = skewness ( i m 2 i ) Saw how to compute L 0 in O(log n) space.

30 Computing L 2 ( m i2 ) Notation: m i is # of occurrences of i. Algorithm (Tug of War): [Alon, Mathias, Szegedy 96] Let h be a random hash fn. h(i) -> {-1,+1} w/ prob ½ h is random, yet consistent for a given i. Every time you see an element i compute h(i) Track X sum of hash values X = i m i h(i) Output: Estimate Y=X 2

31 Pf: E[Y] = E[X 2 ] = E[( m i h(i)) 2 ] (expectation over hash fns) = E[ m 2 i h i 2 + i,j:i j h i h(j) m i m j ] Now, E[h(i) 2 ] = 1, (because h(i) = ±1) E[h(i) h(j)] = 0 (h(i) and h(j) independent) So, E[Y]= m i 2 As usual, we take O(1) such estiamtes + their average Can show: Estimator has low variance 4-wise independent hash functions suffice.

32 Approximate Max-flow Major breakthroughs in last couple of years. (1+ε)-approximate max-flow in near linear time Decompose Graph into simpler structures (expander-like) Expanders: All cuts are well Captured by simpler vertex cuts Get the rough flow right; Fix errors in subsequent iterations. Graph Expander Tree of Expanders

33 Local Partitioning (Light Networks, Philips) Project of Britt Mathijsen w/ Johan van Leeuwaarden + Bansal

34 Light Networks Wireless capability: Control, monitor, exchange performance data Segment controller for a region

35 Light Networks Goal: Partition network in pieces, s.t. each piece (i) Good intra-connectvity (ii) Roughly equal size (iii) Small diameter (few hops) (iv) Low failure probability Impossible to approach via traditional algorithms Idea: Local partitioning algorithms are easy to tailor.

36 Local Partitioning Algorithms Studied by Spielman-Teng and Andersen-Peres (inspired by big data) Find a well-connected piece containing node v. v Time proportional to size of output piece

37 Local Algorithms

38 Local Algorithms

39 Local Algorithms

40 Local Algorithms

41 ASML Error log Problem (clustering with eigenvalues) Tom Slenders w/ Peter van den Hamer (ASML) + Bansal

42 Problem description Slide 42 3/31/2015

43 Problem description Slide 43 3/31/2015

44 Problem description Slide 44 3/31/2015 Site A Site B Similar?

45 ASML has huge database of error logs Elaborate pattern-matching system; using domain knowledge For us: No clue what these errors mean Starting point: Unsupervised learning (Cluster via eigenvalues, SVD)

46 Clustering with Eigenvalues Document = Bag of words (100k dimensional vector) Suppose reality = Few topics (k) Document = 0.3 topic topic 2 +. Can we automatically find topics? Word 1 Word 2. Word m Rank k Factorization: Basic linear algebra (mxn matrix) M = AB SVD: Best rank k approximation (space spanned by k largest singular values) M m x n = A m x k B k x n

47 Very widely used Finds hidden topics without even knowing them Recommendation systems (netflix): Users preferences = combination of genres Movies = combination of genres In practice: Also combine with semi-supervised methods etc.

48 Drawing with eigenvectors Eigenvector v: Av = λv Some eigenvector of the Laplacian: [ , 0.26, 0.44, 0.26] What can you do with this?

49 3 rd smallest eigenvector Drawing with eigenvectors Eigenvectors of Laplacian 2 nd smallest eigenvector

50 Drawing with eigenvectors Slide 50 3/31/2015 Eigenvectors of Laplacian

51 Drawing with eigenvectors

52 Drawing with eigenvectors Eigenvectors of Laplacian

53 Drawing with eigenvectors Eigenvectors of Normalized Laplacian

54 Demo on NY times database (Tom Slenders)

55 Why does it work? My motivation was fake: Can have negative combinations Document = -1.2 topic topic 2 + Right notion: Non-negative rank factorization (NP-hard) Perhaps easy in non-worst case instances? (big open question)

56 Clustering (2): new angle Clustering: Points in space (optimize some objective: k-means, min-sum, k-center, ) Hope: There is some ground truth Optimizing some objective will get you close. Apply factor 100 approximation algorithm. New Model [Balcan, Blum, Gupta]: Let us make this hope explicit Assume: Every <= optimal clustering is ε-close to target. Result(s): Can find 1+O ε close clustering, even though is much smaller than best approximation possible.

57 All possible instances Instances with the property Algorithm Idea: Strong properties on how input looks like (e.g. few outliers). Exploit these. Punchline: These algorithms may actually perform much better than standard k-means, min-sum type algorithms.

58 Correlation Clustering (new model motivated by big data) [Bansal, Blum, Chawla]

59 Clustering Traditional Approach: Objects -> points in some high dim. space Some distance function Some objective (k-median, k-means, ) Hope something useful comes out

60 Clustering Document clustering: Bag of words (traditional approach) Another approach (Bansal, Blum, Chawla) Correlation Clustering: Clustering via pairwise similarity. Classifier: Takes two documents and tells how similar they are dissimilar Doc 1 Doc 2 similar Idea: Use this classifier for clustering.

61 Correlation Clustering E.g. Run classifier on every pair of items dissimilar similar

62 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge

63 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge

64 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge Goal: Find clustering agreeing on most edges Interesting approximation algorithms Quite successful Several extensions (not all pairs, which pair to probe, )

65 Concluding Remarks Very small glimpse (streaming and sketching, statistical learning, machine learning, dealing with noisy data, sub-linear algorithms, ) Exciting new algorithmic problems 1) Huge impact 2) Beautiful ideas 3) Interdisciplinary: Diverse skills + knowledge

66 Next Talk

67 Thanks for your attention!