Big Data. Lecture 6: Locality Sensitive Hashing (LSH)



Similar documents
Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages

Challenges in Finding an Appropriate Multi-Dimensional Index Structure with Respect to Specific Use Cases

Similarity Search in a Very Large Scale Using Hadoop and HBase

Clustering using Simhash and Locality Sensitive Hashing in Hadoop HDFS : An Infrastructure Extension

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Fast Matching of Binary Features

Big Data Analytics CSCI 4030

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data

Clustering and Load Balancing Optimization for Redundant Content Removal

Efficient Approximate Similarity Search Using Random Projection Learning

Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0

Learning Binary Hash Codes for Large-Scale Image Search

Spam Detection Using Customized SimHash Function

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Monitoring Frequency of Change By Li Qin

C-Bus Voltage Calculation

Lecture 4 Online and streaming algorithms for clustering

Counting Problems in Flash Storage Design

Jubatus: An Open Source Platform for Distributed Online Machine Learning

New Hash Function Construction for Textual and Geometric Data Retrieval

Machine Learning Final Project Spam Filtering

1 Gambler s Ruin Problem

Big Data & Scripting Part II Streaming Algorithms

Predictive Indexing for Fast Search

Mean Shift Based Clustering in High Dimensions: A Texture Classification Example

Data Warehousing und Data Mining

CSC574 - Computer and Network Security Module: Intrusion Detection

Lecture 6 Online and streaming algorithms for clustering

Introduction to NP-Completeness Written and copyright c by Jie Wang 1

In order to describe motion you need to describe the following properties.

Geometry and Topology from Point Cloud Data

Lecture #2. Algorithms for Big Data

The Advantages and Disadvantages of Network Computing Nodes

The Online Freeze-tag Problem

Topological Data Analysis Applications to Computer Vision

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

CIS 700: algorithms for Big Data

United Arab Emirates University College of Sciences Department of Mathematical Sciences HOMEWORK 1 SOLUTION. Section 10.1 Vectors in the Plane

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Streaming Algorithms

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

BALTIC OLYMPIAD IN INFORMATICS Stockholm, April 18-22, 2009 Page 1 of?? ENG rectangle. Rectangle

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

MODELING RANDOMNESS IN NETWORK TRAFFIC

CS Computer and Network Security: Intrusion Detection

Efficient Similarity Joins for Near Duplicate Detection

5.4 Closest Pair of Points

Load Balancing between Computing Clusters

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

Efficient Similarity Search over Encrypted Data

Sublinear Algorithms for Big Data. Part 4: Random Topics

Distributed Computing over Communication Networks: Maximal Independent Set

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Finding Similar Items

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Stat 134 Fall 2011: Gambler s ruin

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Caching Dynamic Skyline Queries

Big Data Begets Big Database Theory

Part II: Bidding, Dynamics and Competition. Jon Feldman S. Muthukrishnan

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

SECTION 6: FIBER BUNDLES

Product quantization for nearest neighbor search

Chapter 6: Episode discovery process

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Can linear programs solve NP-hard problems?

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Text Clustering Using LucidWorks and Apache Mahout

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

To determine vertical angular frequency, we need to express vertical viewing angle in terms of and. 2tan. (degree). (1 pt)

PulsON RangeNet / ALOHA Guide to Optimal Performance. Brandon Dewberry, CTO

Storage Basics Architecting the Storage Supplemental Handout

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

MACHINE LEARNING IN HIGH ENERGY PHYSICS

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Transcription:

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Nearest Neighbor Given a set P of n oints in R d

Nearest Neighbor Want to build a data structure to answer nearest neighbor queries

Voronoi Diagram Build a Voronoi diagram & a oint location data structure

Curse of dimensionality In R 2 the Voronoi diagram is of size O(n) Query takes O(logn) time In R d the comlexity is O(n d/2 ) Other techniques also scale bad with the dimension

Locality Sensitive Hashing We will use a family of hash functions such that close oints tend to hash to the same bucket. Put all oints of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket

Locality Sensitive Hashing Def (Charikar): A family H of functions is locality sensitive with resect to a similarity function 0 sim(,q) 1 if Pr[h() = h(q)] = sim(,q)

Examle Hamming Similarity Think of the oints as strings of m bits and consider the similarity sim(,q) = 1-ham(,q)/m H={h i () = the i-th bit of } is locality sensitive wrt sim(,q) = 1-ham(,q)/m Pr[h() = h(q)] = 1 ham(,q)/m 1-sim(,q) = ham(,q)/m

Examle - Jaacard Think of and q as sets sim(,q) = jaccard(,q) = q / q H={h () = min in of the items in } Pr[h () = h (q)] = jaccard(,q) Need to ick from a min-wise ind. family of ermutations

Ma to {0,1} Draw a function b to 0/1 from a airwise ind. family B So: h() h(q) b(h()) = b(h(q)) = 1/2 H ={b(h()) hh, bb} (1 sim(, q)) 1 sim(, q) Pr b( h( )) b( h( q)) sim(, q) 2 2

Another examle ( simhash ) H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} r

Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[h r () = h r (q)] =?

Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1-

Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} θ Pr[ hr = hr q] 1- sim(, q)

Another examle H = {h r () = 1 if r > 0, 0 otherwise r is a random unit vector} Pr[ hr = hr q] 1- sim(, q) θ For binary vectors (like term-doc) incidence vectors: cos 1 A B AB

How do we really use it? Reduce the number of false ositives by concatenating hash function to get new hash functions ( signature ) sig() = h 1 ()h 2 () h 3 ()h 4 () = 00101010 Very close documents are hashed to the same bucket or to close buckets (ham(sig(),sig(q)) is small) See aers on removing almost dulicates

A theoretical result on NN

Locality Sensitive Hashing Thm: If there exists a family H of hash functions such that Pr[h() = h(q)] = sim(,q) then d(,q) = 1-sim(,q) satisfies the triangle inequality

Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = 1-sim(,q) then this holds with 1 = 1-r 1 and 2 =1-r 2 r 1, r 2

Locality Sensitive Hashing Alternative Def (Indyk-Motwani): A family H of functions is (r 1 < r 2, 1 > 2 )- sensitive if d(,q) r 1 Pr[h() = h(q)] 1 d(,q) r 2 Pr[h() = h(q)] 2 r 1 r 2 If d(,q) = ham(,q) then this holds with 1 = 1-r 1 /m and 2 =1-r 2 /m r 1, r 2

(r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. 2) If there is no s.t. d(,q)(1+ε)r return nothing. ((1) is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)

(r,ε)-neighbor roblem 1) If there is a neighbor, such that d(,q)r, return, s.t. d(,q) (1+ε)r. r (1+ε)r

(r,ε)-neighbor roblem 2) Never return such that d(,q) > (1+ε)r r (1+ε)r

(r,ε)-neighbor roblem We can return, s.t. r d(,q) (1+ε)r. r (1+ε)r

(r,ε)-neighbor roblem Lets construct a data structure that succeeds with constant robability Focus on the hamming distance first

NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1

NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions..

NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many?

NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family If there is a neighbor at distance r we catch it with robability 1 so to guarantee catching it we need 1/ 1 functions.. But we also get false ositives in our 1/ 1 buckets, how many? n 2 / 1

NN using locality sensitive hashing Take a (r 1 < r 2, 1 > 2 ) = (r < (1+)r, 1-r/m > 1-(1+)r/m) - sensitive family Make a new function by concatenating k of these basic functions We get a (r 1 < r 2, ( 1 ) k > ( 2 ) k ) If there is a neighbor at distance r we catch it with robability ( 1 ) k so to guarantee catching it we need 1/( 1 ) k functions.. But we also get false ositives in our 1/( 1 ) k buckets, how many? n( 2 ) k /( 1 ) k

(r,ε)-neighbor with constant rob Scan the first 4n( 2 ) k /( 1 ) k oints in the buckets and return the closest A close neighbor ( r 1 ) is in one of the buckets with robability 1-(1/e) There are 4n( 2 ) k /( 1 ) k false ositives with robability 3/4 Both events haen with constant rob.

Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this. time 2*min k

Analysis Total query time: (each o takes time ro. to the dim.) 1 k k 2 n 1 k We want to choose k to minimize this: k n 2 k n k 1 k 2 k log ( n) (loglog n) 1 2

Total query time: Put: Summary 1 2 k 1 k 2 n 1 k log ( n) (loglog n) k log ( n) 1 1 2 1 n log log 1 1 1 2 n Total sace: nn

What is? Query time: log ( n) 1 1 2 1 n log log 1 1 1 2 n Total sace: n n 1 log r log 1 1 log 1 m 1 1 log 2 (1 ) r 1 log log 1 m 2

(1+ε)-aroximate NN Given q find such that d(q,) (1+ε)d(q, ) We can use our solution to the (r,)- neighbor roblem

(1+ε)-aroximate NN vs (r,ε)- neighbor roblem If we know r min and r max we can find (1+ε)- aroximate NN using log(r max /r min ) (r,ε ε/2)-neighbor roblems r (1+ε)r

LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = i X i h()-h(q) = i X i - q i X i = ( i -q i )X i = -q X

LSH using -stable distributions Definition: A distribution D is 2-stable if when X 1,,X d are drawn from D, v i X i = v X where X is drawn from D. So what do we do with this? h() = (X+b)/r Pick r to maximize ρ r

Bibliograhy M. Charikar: Similarity estimation techniques from rounding algorithms. STOC 2002: 380-388 P. Indyk, R. Motwani: Aroximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998: 604-613. A. Gionis, P. Indyk, R. Motwani: Similarity Search in High Dimensions via Hashing. VLDB 1999: 518-529 M. R. Henzinger: Finding near-dulicate web ages: a largescale evaluation of algorithms. SIGIR 2006: 284-291 G. S. Manku, A. Jain, A. Das Sarma: Detecting neardulicates for web crawling. WWW 2007: 141-150