Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

Similar documents
Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)

Machine Learning Final Project Spam Filtering

Big Data. Lecture 6: Locality Sensitive Hashing (LSH)

Big Data Analytics CSCI 4030

Statistical Machine Learning

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

1 if 1 x 0 1 if 0 x 1

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Simple and efficient online algorithms for real world applications

Solutions to Math 51 First Exam January 29, 2015

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Numerical Matrix Analysis

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

The Advantages and Disadvantages of Network Computing Nodes

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Large-Scale Similarity and Distance Metric Learning

1 Review of Least Squares Solutions to Overdetermined Systems

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Structures Fibonacci Heaps, Amortized Analysis

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Lecture 2: Homogeneous Coordinates, Lines and Conics

Full and Complete Binary Trees

Name: Section Registered In:

The Power of Asymmetry in Binary Hashing

Introduction to Support Vector Machines. Colin Campbell, Bristol University

4.5 Linear Dependence and Linear Independence

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Lecture Topic: Low-Rank Approximations

Lecture 2 February 12, 2003

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Notes on Complexity Theory Last updated: August, Lecture 1

Data Warehousing und Data Mining

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Monotonicity Hints. Abstract

3 An Illustrative Example

Proximal mapping via network optimization

Steven C.H. Hoi School of Information Systems Singapore Management University

NOTES ON LINEAR TRANSFORMATIONS

Guessing Game: NP-Complete?

Sublinear Algorithms for Big Data. Part 4: Random Topics

Introduction to Learning & Decision Trees

Polynomial Invariants

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

Equilibrium computation: Part 1

Part 2: Community Detection

Efficient Data Structures for Decision Diagrams

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Language Modeling. Chapter Introduction

Continued Fractions and the Euclidean Algorithm

Lecture 3: Finding integer solutions to systems of linear equations

3. INNER PRODUCT SPACES

Social Media Mining. Data Mining Essentials

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Server Load Prediction

Fast Matching of Binary Features

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Chapter 13: Query Processing. Basic Steps in Query Processing

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

THE DIMENSION OF A VECTOR SPACE

MODELING RANDOMNESS IN NETWORK TRAFFIC

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Approximation Algorithms

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

An Overview of Knowledge Discovery Database and Data mining Techniques

Follow the Perturbed Leader

Arrangements And Duality

Lecture #2. Algorithms for Big Data

Big Data - Lecture 1 Optimization reminders

1 Solving LPs: The Simplex Algorithm of George Dantzig

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Globally Optimal Crowdsourcing Quality Management

Abstract Data Type. EECS 281: Data Structures and Algorithms. The Foundation: Data Structures and Abstract Data Types

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Similarity and Diagonalization. Similar Matrices

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

CSCI567 Machine Learning (Fall 2014)

LEARNING OBJECTIVES FOR THIS CHAPTER

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

Several Views of Support Vector Machines

MHI3000 Big Data Analytics for Health Care Final Project Report

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Collaborative Filtering. Radek Pelánek

Optimization Modeling for Mining Engineers

Log-Linear Models. Michael Collins

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Machine Learning over Big Data

Statistical Machine Translation: IBM Models 1 and 2

Lecture 1: Course overview, circuits, and formulas

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Machine Learning and Pattern Recognition Logistic Regression

Active Learning SVM for Blogs recommendation

Transcription:

Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) Anshumali Shrivastava and Ping Li Cornell University and Rutgers University December 9th 2014 Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 1 / 24

The Maximum Inner Product Search (MIPS) Problem Given a query q R D and a giant collection C of N vectors in R D, search for p C s.t., p = arg max q T x x C Worst case O(N) for any query. N is huge. O(N) quite expensive for frequent queries. Our goal is to solve this efficiently (something sub-linear) Not same as the classical near-neighbor search problem. arg min q x C x 2 2 = arg min x C ( x 2 2 2q T x) arg max x C q T x Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 2 / 24

Scenario 1: User-Item Recommendations R = U x V x = r i,j u i v j Matrix factorizations for collaborative filtering. Given a user u i, the best item to recommend is a MIPS instance Item = arg max j r i,j = arg max j u T i v j Vectors u i and v j are learned. No control over norms. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 3 / 24

Scenario 2: Multi-Class Prediction Standard multi-class SVM in practice learns a weight vector w i for each of the class label i L. Predicting a new x test, is a MIPS instance: y test = arg max i L x T test w i w i s are learned and usually have varying norms. Note: Fine grain ImageNet classification has 100,000 classes, in practice this number can be much higher. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 4 / 24

Scenario 3: Web-Scale Deep Architectures Activation of hidden node i monotonic in x T w i. MAXOUT (Goodfellow et. al. 13) only requires updating hidden nodes having max activations. h(x) = max j x T w j ; Max-Product Networks (Gens & Domingos 12) Adaptive Dropouts (Ba & Frey 13 ) Problems: Each iteration, find max activation for every data point in every layer. Networks with millions (or billions) of hidden nodes? Efficient MIPS = Fast Training and Testing of Giant Networks Note: No control over norms. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 5 / 24

Many More!! Max of affine function approximations. (Prof. Nesterov s Talk) Active Learning Deformable parts model in vision Cutting plane methods and Greedy coordinate ascent. Greedy matching pursuit.... Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 6 / 24

Outline of the talk Brief Introduction to Locality Sensitive Hashing (LSH). A Negative Result, Symmetry Cannot Solve MIPS. Construction of Asymmetric Hashing (ALSH) for MIPS Experiments. Extensions and More Connections. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 7 / 24

Locality Sensitive Hashing (LSH) Hashing: Function (randomized) h that maps a given data vector x R D to an integer key h : R D [0, 1, 2,..., N] Locality Sensitive: Additional property where f is monotonically increasing. Pr h [ h(x) = h(y) ] = f (sim(x, y)), Similar points are more likely to have the same hash value (hash collision) compared to dissimilar points. Likely Unlikely h 0 1 2 3 Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 8 / 24

Hashing for L 2 Distance L2-LSH r T x + b h w (x) = w where r R D drawn independently from N(0, I), b is drawn uniformly from [0, w]. w is a parameter. is the floor operation. It can be shown that Pr(h w (x) = h w (y)) is monotonic in x y 2 b w X=0 b hurts. 1-bit (Sign) or 2-bit hashing is preferred (Li et. al. ICML 14) Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 9 / 24

Sub-linear Near Neighbor Search: Idea h 1 h 2 R D h 1, h 2 : R D {0, 1, 2, 3} h 1 h 2 Buckets (pointers only) 00 00 00 01 00 10 Empty 11 11 Given query q, if h 1 (q) = 11 and h 2 (q) = 01, then probe bucket with index 1101. It is a good bucket!! (LSH Property) h i (q) = h i (x) is an indicator of high similarity between q and x for all i. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 10 / 24

The Classical LSH Algorithm Table 1 1 h 1 1 h K Buckets 00 00 00 01 00 10 Empty 11 11 Table L L h 1 L h K Buckets 00 00 00 01 00 10 11 11 Empty Querying: Report union of L buckets. We can use K concatenation. To improve recall we can repeat the process L times and take union. K and L are two knobs. Theory says we have a sweet spot. Provable sub-linear algorithm. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 11 / 24

Negative Result: LSH Cannot Solve MIPS!! For inner products, we can have x and y, s.t x T y > x T x. Self similarity is not the highest similarity. Under any hash function Pr(h(x) = h(x)) = 1. But we need Pr(h(x) = h(y)) > Pr(h(x) = h(x)) = 1 We cannot have Locality Sensitive Hashing for inner products! Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 12 / 24

Extend Framework: Asymmetric LSH (ALSH) Main concern: Pr(h(x) = h(x)) = 1, an obvious identity. How about asymmetry? We can use P(.) for creating buckets. While querying probe buckets using Q(.) with P Q. Pr(Q(x) = P(x)) 1 All we need is Pr(Q(q) = P(x)) to be monotonic in q T x. Same proofs work!! Symmetry in hashing is unnecessary part of LSH definition. Fine! How do I construct P and Q? Reduce the problem to known domain we are comfortable with. (Tea-kettle principle in mathematics) Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 13 / 24

Construction Known: Pr ( h(q) = h(x) ) = f ( q 2 2 + x 2 2 2qT x) (L2 LSH) Idea: Construct P and Q s.t. Q(q) 2 2 + P(x) 2 2 2Q(q)T P(x) is monotonic (or approximately) in q T x. This is also sufficient!! (We can expand dimensions as part of P and Q.) Pre-Processing Scale data x, such that x i < 1 x i C Querying q Q(q) = [q; 1/2; 1/2;...; 1/2] P(x i ) = [x i ; x i 2 2; x i 4 2;...; x i 2m 2 ] Q(q) P(x i ) 2 2 = ( q 2 + m/4) 2q T x i + x i 2m+1 2 x i 2m+1 0, and m is constant. We therefore have arg max x S qt x arg min x S Q(q) P(x) 2 Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 14 / 24

Main Result: First Provable Hashing Algorithm for MIPS Theorem (Approximate MIPS is Efficient) For the problem of c-approximate MIPS, one can construct a data structure having O(n ρ log n) query time and space O(n 1+ρ ), where ρ < 1. ρ u = ( ( ) log F r m/2 2S U 2 0 + 2U 2m+1) M 2 min 0<U<1,m N,r ( ( ) log F r m/2 2cS U 2 ) 0 M 2 s.t. U (2m+1 2) M 2 S 0 < 1 c, The only assumption needed is bounded norms, i.e. M is finite. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 15 / 24

Why parameters do not bother us? Theory Even classical LSH requires computing K and L for a given c-approximate instance, and there is no fixed universal choice. Not hyper-parameters, can be exactly computed. In theory, we do not lose any properties. In practice, there is a good choice: m = 3, U = 0.83, r = 2.5 ρ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 1 0.8 0.6 m=3,u=0.83, r=2.5 S 0 = 0.5U c 0.4 0.6 0.7 0.8 S 0 = 0.9U Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 16 / 24 0.2 0

The Final Algorithm Preprocessing: Scale = Append 3 Numbers = Usual L2 - LSH Scale x C to have norm 0.83 Append x i 2, x i 4, and x i 8 to vector x i. (just 3 scalars) Use standard L2 hash to create hash tables. Querying: Append 3 Numbers = Usual L2 - LSH Append 0.5 three times to the query q. (just three 0.5s) Use standard L2 hash on the transformed query to probe buckets. A surprisingly simple algorithm. Trivial to implement. How much benefit compared to standard hash functions? Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 17 / 24

Datasets and Settings Datasets Movielens (10M) Netflix Latent User Item Features: Matrix factorization to generate user and item latent features. Dimension: 150 for Movielens and 300 for Netflix. Given a user as query, finding the best item is a MIPS instance. Aim: Evaluate and compare computational savings. Competing Hash Functions ALSH (proposed) L2-LSH (LSH for L2 distance) Signed Random Projections (SRP) Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 18 / 24

Ranking Quality Based on Hash Collisions Precision (%) 15 10 5 Movielens Proposed L2LSH SRP Top 10, K = 16 Precision (%) 30 20 10 Movielens Proposed L2LSH SRP Top 10, K = 64 Precision (%) 60 40 20 Movielens Proposed L2LSH SRP Top 10, K = 256 Precision (%) 0 0 20 40 60 80 100 Recall (%) 10 8 6 4 2 NetFlix Proposed L2LSH SRP Top 10, K = 16 Precision (%) 0 0 20 40 60 80 100 Recall (%) 20 15 10 5 NetFlix Proposed L2LSH SRP Top 10, K = 64 Precision (%) 0 0 20 40 60 80 100 Recall (%) 50 40 30 20 10 NetFlix Proposed L2LSH SRP Top 10, K = 256 0 0 20 40 60 80 100 Recall (%) 0 0 20 40 60 80 100 Recall (%) 0 0 20 40 60 80 100 Recall (%) Figure: Precision-Recall curves (higher is better), for top-10 items. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 19 / 24

In Action : Savings in Top-k Item Recommendation Fraction Multiplications 1 0.8 0.6 0.4 0.2 Previous evaluations are not true indicators of computational savings. Bucketing Experiments: We need to ensure comparison is fair. Find best K [1 40] and L [1 400] for each recall. Summary of 16000 experiments. Averaged over 2000 queries. Proposed SRP L2LSH Top 50 Movielens 0 0 0.2 0.4 0.6 0.8 1 Recall Fraction Multiplications 1 0.8 0.6 0.4 0.2 Proposed SRP L2LSH Top 50 Netflix 0 0 0.2 0.4 0.6 0.8 1 Recall Figure: Mean inner products computed, relative to a linear scan. Lower is better. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 20 / 24

A Generic Recipe for Constructing New ALSHs In this work LSH for S (q, x) = q x 2. = ALSH for S(q, x) = q T x. Can we do better? (Yes!) Signed Random Projections (SRP) leads to more informative hashes than L2-LSH (Li et. al. ICML 2014). SRP as Base LSH? Use S (q, x) = qt x q 2 x 2, we can construct a different set of P and Q. For binary data: Minwise hashing and S (q, x) = q x q x. (Jaccard) Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 21 / 24

Asymmetric Minwise Hashing for Set Intersection (Shrivastava & Li 2014) Asymmetric Minwise Hashing For sparse data (not necessarily binary): Minwise hashing is better than SRP (Shrivastava & Li AISTATS 14) S (q, x) = q x q x (Jaccard Similarity). Another P and Q with minwise hashing, significant improvements. P(x) = [x; M f x ; 0] Q(x) = [x; 0; M f x ] f x is number of non-zeros in x, M is maximum sparsity. Excercise in Construction of P and Q: Expand dimensionality and cancel out effect of terms we do not want in S (q, x). Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 22 / 24

Conclusions We provide the first provable and practical hashing solution to MIPS. MIPS occurs as a subroutine in many machine learning application. All those applications will directly benefit from this work. Constructing asymmetric transformations to reduce MIPS to near-neighbor search was the key realization. In the task of item recommendation, we obtain significant savings compared to well known heuristics. Idea behind the ALSH construction is general and it connects MIPS to many known similarity search problems. Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 23 / 24

Acknowledgments Funding Agencies: NSF, ONR, and AFOSR. Reviewers of KDD 2014 and NIPS 2014 Prof. Thorsten Joachims and Class of CS 6784 at Cornell University (Spring 2014) for feedbacks and suggestions. Thanks for Your Attention!! Anshumali Shrivastava and Ping Li NIPS 2014 December 9th 2014 24 / 24