Big Data. Lecture 6: Locality Sensitive Hashing (LSH)

Size: px

Start display at page:

Download "Big Data. Lecture 6: Locality Sensitive Hashing (LSH)"

Darcy Clinton Green
8 years ago
Views:

1 Big Data Lecture 6: Locality Sensitive Hashing (LSH)

2 Nearest Neighbor Given a set P of n oints in R d

3 Nearest Neighbor Want to build a data structure to answer nearest neighbor queries

4 Voronoi Diagram Build a Voronoi diagram & a oint location data structure

5 Curse of dimensionality In R 2 the Voronoi diagram is of size O(n) Query takes O(logn) time In R d the comlexity is O(n d/2 ) Other techniques also scale bad with the dimension

6 Locality Sensitive Hashing We will use a family of hash functions such that close oints tend to hash to the same bucket. Put all oints of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket

7 Locality Sensitive Hashing Def (Charikar): A family H of functions is locality sensitive with resect to a similarity function 0 sim(,q) 1 if Pr[h() = h(q)] = sim(,q)

8 Examle Hamming Similarity Think of the oints as strings of m bits and consider the similarity sim(,q) = 1-ham(,q)/m H={h i () = the i-th bit of } is locality sensitive wrt sim(,q) = 1-ham(,q)/m Pr[h() = h(q)] = 1 ham(,q)/m 1-sim(,q) = ham(,q)/m

i () = the i-th bit of } is locality sensitive wrt sim(,q) =

9 Examle - Jaacard Think of and q as sets sim(,q) = jaccard(,q) = q / q H={h () = min in of the items in } Pr[h () = h (q)] = jaccard(,q) Need to ick from a min-wise ind. family of ermutations

10 Ma to {0,1} Draw a function b to 0/1 from a airwise ind. family B So: h() h(q) b(h()) = b(h(q)) = 1/2 H ={b(h()) hh, bb} (1 sim(, q)) 1 sim(, q) Pr b( h( )) b( h( q)) sim(, q) 2 2

11 Another examle ( simhash ) H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} r

12 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[h r () = h r (q)] =?

13 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1-

14 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1- sim(, q)

15 Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[ hr = hr q] 1- sim(, q) θ For binary vectors (like term-doc) incidence vectors: cos 1 A B AB

16 How do we really use it? Reduce the number of false ositives by concatenating hash function to get new hash functions ( signature ) sig() = h 1 ()h 2 () h 3 ()h 4 () = Very close documents are hashed to the same bucket or to close buckets (ham(sig(),sig(q)) is small) See aers on removing almost dulicates

hash functions ( signature ) sig() = h 1 ()h 2 () h 3 ()h 4 () = 00101010 Very

17 A theoretical result on NN

18 Locality Sensitive Hashing Thm: If there exists a family H of hash functions such that Pr[h() = h(q)] = sim(,q) then d(,q) = 1-sim(,q) satisfies the triangle inequality

19 Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = 1-sim(,q) then this holds with 1 = 1-r 1 and 2 =1-r 2 r 1, r 2

20 Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = ham(,q) then this holds with 1 = 1-r 1 /m and 2 =1-r 2 /m r 1, r 2

21 (r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. 2) If there is no s.t. d(,q)(1+ε)r return nothing. ((1) is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)

22 (r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. r (1+ε)r

23 (r,ε)-neighbor roblem 2) Never return such that d(,q) > (1+ε)r r (1+ε)r

24 (r,ε)-neighbor roblem We can return, s.t. r d(,q) (1+ε)r. r (1+ε)r

25 (r,ε)-neighbor roblem Lets construct a data structure that succeeds with constant robability Focus on the hamming distance first

26 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1

27 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions..

28 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many?

29 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many? n 2 / 1

30 NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family Make a new function by concatenating k of these basic functions We get a (r 1 < r 2, ( 1 ) k > ( 2 ) k ) If there is a neighbor at distance r we catch it with robability ( 1 ) k so to guarantee catching it we need 1/( 1 ) k functions.. But we also get false ositives in our 1/( 1 ) k buckets, how many? n( 2 ) k /( 1 ) k

31 (r,ε)-neighbor with constant rob Scan the first 4n( 2 ) k /( 1 ) k oints in the buckets and return the closest A close neighbor ( r 1 ) is in one of the buckets with robability 1-(1/e) There are 4n( 2 ) k /( 1 ) k false ositives with robability 3/4 Both events haen with constant rob.

32 Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this. time 2*min k

33 Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this: k n 2 k n k 1 k 2 k log ( n) (loglog n) 1 2

34 Total query time: Put: Summary 1 2 k 1 k 2 n 1 k log ( n) (loglog n) k log ( n) n log log n Total sace: nn

35 What is? Query time: log ( n) n log log n Total sace: n n 1 log r log 1 1 log 1 m 1 1 log 2 (1 ) r 1 log log 1 m 2

36 (1+ε)-aroximate NN Given q find such that d(q,) (1+ε)d(q, ) We can use our solution to the (r,)- neighbor roblem

37 (1+ε)-aroximate NN vs (r,ε)- neighbor roblem If we know r min and r max we can find (1+ε)- aroximate NN using log(r max /r min ) (r,ε ε/2)-neighbor roblems r (1+ε)r

38 LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = i X i h()-h(q) = i X i - q i X i = ( i -q i )X i = -q X

39 LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = (X+b)/r Pick r to maximize ρ r

40 Bibliograhy M. Charikar: Similarity estimation techniques from rounding algorithms. STOC 2002: P. Indyk, R. Motwani: Aroximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: A. Gionis, P. Indyk, R. Motwani: Similarity Search in High Dimensions via Hashing. VLDB 1999: M. R. Henzinger: Finding near-dulicate web ages: a largescale evaluation of algorithms. SIGIR 2006: G. S. Manku, A. Jain, A. Das Sarma: Detecting neardulicates for web crawling. WWW 2007:

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015 Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages