Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Nearest Neighbor Given a set P of n oints in R d
Nearest Neighbor Want to build a data structure to answer nearest neighbor queries
Voronoi Diagram Build a Voronoi diagram & a oint location data structure
Curse of dimensionality In R 2 the Voronoi diagram is of size O(n) Query takes O(logn) time In R d the comlexity is O(n d/2 ) Other techniques also scale bad with the dimension
Locality Sensitive Hashing We will use a family of hash functions such that close oints tend to hash to the same bucket. Put all oints of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket
Locality Sensitive Hashing Def (Charikar): A family H of functions is locality sensitive with resect to a similarity function 0 sim(,q) 1 if Pr[h() = h(q)] = sim(,q)
Examle Hamming Similarity Think of the oints as strings of m bits and consider the similarity sim(,q) = 1-ham(,q)/m H={h i () = the i-th bit of } is locality sensitive wrt sim(,q) = 1-ham(,q)/m Pr[h() = h(q)] = 1 ham(,q)/m 1-sim(,q) = ham(,q)/m
Examle - Jaacard Think of and q as sets sim(,q) = jaccard(,q) = q / q H={h () = min in of the items in } Pr[h () = h (q)] = jaccard(,q) Need to ick from a min-wise ind. family of ermutations
Ma to {0,1} Draw a function b to 0/1 from a airwise ind. family B So: h() h(q) b(h()) = b(h(q)) = 1/2 H ={b(h()) hh, bb} (1 sim(, q)) 1 sim(, q) Pr b( h( )) b( h( q)) sim(, q) 2 2
Another examle ( simhash ) H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} r
Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[h r () = h r (q)] =?
Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1-
Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1- sim(, q)
Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[ hr = hr q] 1- sim(, q) θ For binary vectors (like term-doc) incidence vectors: cos 1 A B AB
How do we really use it? Reduce the number of false ositives by concatenating hash function to get new hash functions ( signature ) sig() = h 1 ()h 2 () h 3 ()h 4 () = 00101010 Very close documents are hashed to the same bucket or to close buckets (ham(sig(),sig(q)) is small) See aers on removing almost dulicates
A theoretical result on NN
Locality Sensitive Hashing Thm: If there exists a family H of hash functions such that Pr[h() = h(q)] = sim(,q) then d(,q) = 1-sim(,q) satisfies the triangle inequality
Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = 1-sim(,q) then this holds with 1 = 1-r 1 and 2 =1-r 2 r 1, r 2
Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = ham(,q) then this holds with 1 = 1-r 1 /m and 2 =1-r 2 /m r 1, r 2
(r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. 2) If there is no s.t. d(,q)(1+ε)r return nothing. ((1) is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)
(r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. r (1+ε)r
(r,ε)-neighbor roblem 2) Never return such that d(,q) > (1+ε)r r (1+ε)r
(r,ε)-neighbor roblem We can return, s.t. r d(,q) (1+ε)r. r (1+ε)r
(r,ε)-neighbor roblem Lets construct a data structure that succeeds with constant robability Focus on the hamming distance first
NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1
NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions..
NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many?
NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many? n 2 / 1
NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family Make a new function by concatenating k of these basic functions We get a (r 1 < r 2, ( 1 ) k > ( 2 ) k ) If there is a neighbor at distance r we catch it with robability ( 1 ) k so to guarantee catching it we need 1/( 1 ) k functions.. But we also get false ositives in our 1/( 1 ) k buckets, how many? n( 2 ) k /( 1 ) k
(r,ε)-neighbor with constant rob Scan the first 4n( 2 ) k /( 1 ) k oints in the buckets and return the closest A close neighbor ( r 1 ) is in one of the buckets with robability 1-(1/e) There are 4n( 2 ) k /( 1 ) k false ositives with robability 3/4 Both events haen with constant rob.
Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this. time 2*min k
Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this: k n 2 k n k 1 k 2 k log ( n) (loglog n) 1 2
Total query time: Put: Summary 1 2 k 1 k 2 n 1 k log ( n) (loglog n) k log ( n) 1 1 2 1 n log log 1 1 1 2 n Total sace: nn
What is? Query time: log ( n) 1 1 2 1 n log log 1 1 1 2 n Total sace: n n 1 log r log 1 1 log 1 m 1 1 log 2 (1 ) r 1 log log 1 m 2
(1+ε)-aroximate NN Given q find such that d(q,) (1+ε)d(q, ) We can use our solution to the (r,)- neighbor roblem
(1+ε)-aroximate NN vs (r,ε)- neighbor roblem If we know r min and r max we can find (1+ε)- aroximate NN using log(r max /r min ) (r,ε ε/2)-neighbor roblems r (1+ε)r
LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = i X i h()-h(q) = i X i - q i X i = ( i -q i )X i = -q X
LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = (X+b)/r Pick r to maximize ρ r
Bibliograhy M. Charikar: Similarity estimation techniques from rounding algorithms. STOC 2002: 380-388 P. Indyk, R. Motwani: Aroximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: 604-613. A. Gionis, P. Indyk, R. Motwani: Similarity Search in High Dimensions via Hashing. VLDB 1999: 518-529 M. R. Henzinger: Finding near-dulicate web ages: a largescale evaluation of algorithms. SIGIR 2006: 284-291 G. S. Manku, A. Jain, A. Das Sarma: Detecting neardulicates for web crawling. WWW 2007: 141-150