Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Size: px

Start display at page:

Download "Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)"

Horatio Morrison
8 years ago
Views:

1 Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven)

2 Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii) quickly Traveling salesman problem (TSP) n! possibilities ( for n=60) Ideally: Polynomial running time (n 2, n 3 ) 33 cities, 1962 competition

Given a problem: Find (i) best solution (ii) quickly Traveling salesman

3 70 s: Problems Polynomial time (n log n, n 2,n 3 ) E.g. Shortest path, matching, max-flow,... NP-Hard: TSP + most other problems (brute force 2 n = only option) Late 80 s-now: Coping with NP-hardness Approximation algorithms: Even if NP-Hard, may be a 95% optimal solution can be found in polynomial time? (very rich theory/connections) Heuristic Approaches: (often successful in practice)

NP-hardness Approximation algorithms: Even if NP-Hard, may be a 95% optimal solution can be found

4 Heuristic methods (TSP) 120 Germany, USA, ,509 USA, ,978 Sweden, ,900 VLSI ,000 USA, still unsolved

5 Tools seemed quite powerful Tools Problems

6 Last few years

7 Google: Billions of webpages Example Figure out which documents are similar (various reasons: show diverse pages for a query) Polynomial/non-polynomial view is too limited (Even n 2 time is prohibitive for huge n) Don t care about perfect answer. Pages change/disappear, No perfect notion of similarity anyway Want a quick solution. Some error is alright.

(Even n 2 time is prohibitive for huge n) Don t care about perfect answer.

8 Example 2 Facebook: Which 10,000 users (among many millions) should be shown this ad? Want a quick solution. Some error is alright.

9 Very different questions Needed: Totally new ways of thinking Discard old beliefs Beautiful new ideas emerging

10 Rest of the talk A glimpse of some ideas 1) Counting distinct elements 2) Correlation Clustering 3) Local Partitioning Concluding Remarks

11 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { }

12 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3 }

13 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4}

14 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4}

15 Counting distinct elements Input: Stream of numbers (say in range [1,n] ) Example: Goal: Compute number of distinct elements. Here 5 because we saw {3,4,2,17,11} Simple Solution: Just maintain a list of items seen thus far Stream: List: { 3, 4, 2}

16 Counting Distinct elements Note: The algorithm tracks the numbers seen thus far. Question: What if it can remember only 1 number? (very limited memory) Trouble: Can barely remember anything about past. Stream: When you scan next element, no clue if already seen?

(very limited memory) Trouble: Can barely remember anything about past.

17 Seems completely hopeless? Intuition only partly right Cannot hope to count exactly. But who cares if answer is 3,425,269 or 3,425,587? Approximate counting possible!! Technique: Min-hashing (beautiful use of approximation and randomization)

18 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(2) n n

19 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n

20 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n Note: h(i) is same every time i is encountered.

21 Number of distinct elements Basic Idea [Flajolet-Martin 82]: Use a random hash function (map). (e.g. encryption function) h:[1,n] -> [1,n ] say n >> n Algorithm: Keep track of min h(i) Stream = h(8) n n Note: h(i) is same every time i is encountered.

22 Number of distinct elements Keep track of min h(i) Suppose 1 distinct element (stream = ) Min h(i) n / n n 1 n

23 Number of distinct elements Keep track of min h(i) Suppose 2 distinct elements Min h(i) n / n n 1 n If k items seen, expect min-value to be around n /(k+1) Solution: Estimate of # elements = n / min h(i) 1

24 Number of distinct elements Randomness could mess things up. E.g. May just 1 element, But min(h(i)) could be far. expect min(h(i)) = n /2 Standard trick: O(1) such hash functions, take median entry. Theorem: For any ε > 0, can estimate distinct elements to within 1 ± ε factor accuracy with high probability. Space = O 1 ε 2

25 A closer look Random hash function h. We need that h(i) value be same every time we see i. One has to store each h(i) some where. h(1), h(2), h(3),, h(n) need n log n space?? Did we just disguise our inherent problem? There is a fix! Key idea: Do not need full power of randomness

26 What is randomness? Do not need full randomness Pairwise independence: For any a 1, a 2, x 1, x 2 Pr [ h(x 1 ) = a 1 and h(x 2 ) = a 2 ] = 1/n n n Such an h is very simple to store h(x) = ax + b mod (p) [just need to store a and b]

27 Min-Hashing: Applications Similarity of Web pages (if mostly similar words) Google: Page -> Few min-hash values (few bits) Similar page detection: quadratic -> Linear time Sketching Complex Simple Tons of amazing applications ( several researchers )

28 Correlation Clustering (new model motivated by big data)

29 Clustering Cluster documents by topic Cluster users by items bought One of the most fundamental operations in data mining Traditional Approach: Objects -> points in some high dim. space Some distance function Some objective (k-median, k-means, ) Hope something useful comes out

30 Clustering Document clustering: Bag of words (traditional approach) Another approach (Bansal, Blum, Chawla) Correlation Clustering: Clustering via pairwise similarity. Classifier: Takes two documents and tells how similar they are dissimilar Doc 1 Doc 2 similar Idea: Use this classifier for clustering.

31 Correlation Clustering E.g. Run classifier on every pair of items dissimilar similar

32 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge

33 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge

34 In general, there could be inconsistencies dissimilar similar Any clustering, disagrees on at least one edge Goal: Find clustering agreeing on most edges Interesting approximation algorithms Quite successful Several extensions (not all pairs, which pair to probe, )

35 Local Partitioning (Light Networks, Philips)

36 Light Networks Wireless capability: Control, monitor, exchange performance data Segment controller for a region

37 Light Networks Goal: Partition network in pieces, s.t. each piece (i) Good intra-connectvity (ii) Roughly equal size (iii) Small diameter (few hops) (iv) Low failure probability Impossible to approach via traditional algorithms Idea (Bansal, Leeuwaarden, Mathijsen) Local partitioning algorithms are easy to tailor.

38 Local Partitioning Algorithms Studied by Spielman-Teng and Andersen-Peres (inspired by big data) Find a well-connected piece containing node v. v Time proportional to size of output piece

39 Local Algorithms

40 Local Algorithms

41 Local Algorithms

42 Local Algorithms Details: Poster of Britt Mathijsen

43 Concluding Remarks Very small glimpse (streaming and sketching, statistical learning, machine learning, dealing with noisy data, sub-linear algorithms, ) Exciting new algorithmic problems 1) Huge impact 2) Beautiful ideas 3) Interdisciplinary DSCE a platform to bring diverse groups and skills together

44 Thanks for your attention!

B490 Mining the Big Data. 0 Introduction

B490 Mining the Big Data. 0 Introduction B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A