Similarity Search in a Very Large Scale Using Hadoop and HBase

Transcription

1 Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France Conservatoire National des Arts et Metiers, Paris, France Faculty of Informatics, Masaryk University, Brno, Czech Republic November 3, 2010 Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

2 Outline Motivation Disk-based Access Structure Outline Going for Really Large Scale via Map-Reduce Framework Results Conclusions Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

3 Motivation Large collections of multimedia data exist : Flickr, Google Images, etc. Search implemented usually on the textual meta-data represented by annotations or tags Search by content a paradigm where automatic object description based on its content is exploited Extensibility object representation and distance function (no limitation to L P metrics Euclidean vector spaces) Content-based search is a challenging problem on large (web) scale collections Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

4 Motivation Many access structures work well on vector data, but are limited by main memory (LSH, MChord, Metric Inverted File, etc.) With transition to cheaper and larger disk memory, performance degrades significantly Scalability addressed by distribution of the access structure to cluster of computers (one computer with the same RAM is much more $$$) Proposed access structure trades approximation and slower query processing times for high scalability by transition to cheap secondary memory Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

5 Extensibility via Metric Spaces The metric space is defined as a pair (D, d) where D denotes the domain of objects and d : D D R is a total function: x, y D, d(x, y) 0 x, y D, d(x, y) = d(y, x) x, y D, x = y d(x, y) = 0 x, y, z D, d(x, z) d(x, y) + d(y, z) non-negativity symmetry identity triangle inequality Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

6 Access Structure Outline Very sparse distance matrix Locality phenomenon filtering large number of random pivots (P) selected from collection exploiting only small number of closest to indexed object Compute P n distance computations n is the size of indexed collection P n Store only m closest distances for each object m P Final storage complexity m n Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

7 Access Structure Outline Index creation: Search: Reference objects are randomly selected from the collection ( P in thousands) For each object from the collection, distance to all reference objects is computed (M-Tree and knn query) Predefined number m of closest distances are maintained for each object (m < 10) 1 Find s nearest neighbors among reference objects to the query object 2 Find those data objects that have the s reference objects within its m closest reference objects (filtering) 3 Rank these candidates by estimating the original distance (on top of the distances to the s reference objects) Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

8 Locality Phenomenon Outline o3 R4 o3 R4 o3 R4 R5 R3 R5 R3 R5 R3 R1 o1 o2 R6 R1 o1 r q o2 R6 R1 o1 r q o2 R6 R2 R2 R2 Assignment RO cells Filtering with 2 ROs Arrows for assignment denote closest three ROs for each object Different colors denote different cells and their intersection (s = 2, R 2, R 3 ) The cell intersection does more effective pruning than the simple distance pruning! Locality Phenomenon Use also the distances of q to closest ROs to filter out more candidates An object o can be pruned if d(q, RO) d(ro, o) > r An object o cannot be pruned if d(q, RO) r < d(ro, o) < d(q, RO) + r An object must lie in the intersection Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

9 Access Structure Implementation Small Scale The access structure is implemented as a relational database One table for each reference object, storing the object ID and distance from RO Indices are kept for object IDs and distances Filtering (pruning) is inner join of s closest reference object tables with predefined distance ranges Verification of candidates by computing the original distance after object representation retrieval Tables and object representations kept on disk (separately) Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

10 Parameter Setting On various data, one access structure performs differently One access structure on one data performs differently under various parameter settings Tight correlation between indexability of data and its intrinsic dimension How to select the correct number of reference object? P = m n O O estimated average number of objects per table m data s intrinsic dimensionality Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

11 Experimental Evaluation 30M SIFTs x * r recall sift precision sift x * r time sift read sift x * r sift candidates sift GT s = 3, P = 3, ,000, , hours to create the index (two parallel processes on one machine) Relational database as a storage engine 240M rows, 36GB on disk Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

12 Going for the Really Large Scale (Web) Previous evaluation uncovered bottlenecks in the implementation slow build up phase: 30, 000, 000 3, 500 distance computations, 30, 000, inserts limited scalability in terms of storage (size, troughput) The concepts exploited by the proposed access structure allow massive parallelization Both the indexation and search can be parallelized The storage itself can be parallelized The storage needs to be altered to diminish the number of inserts Framework for large scale data operations MapReduce and Google FS Framework for large scale data storage Bigtable Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

13 Google s MapReduce Simplified framework for data processing on large clusters (of commodity HW) In three phases, transforms the data from one form to another, data representation: Key-Value pairs < K, V > Fail-safety via data/process replication Speed via data-local computations Primary data storage in Google File System - replicated and distributed blocks of data (splits, e.g. 64MB) Map takes < K 1, V 1 > and transforms into < K 2, V 2 > pairs Shuffle/Sort takes < K 2, V 2 > and transforms to < K 2, list(v 2 ) > Reduce takes < K 2, list(v 2 ) > and < K 3, V 3 > Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

14 Enhanced Distributed Storage Bigtable A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. [Google Inc.] sparse = variable number of columns per row, any value can be null distributed = stored on a cluster of computers persistent = materialized sorted map = sorted by unique row ID, no secondary indices available By no means a replacement for relational databases Stored as files in GFS, so high availability via replication Extreme scalability with assured performance peta bytes of data on thousands on computers row id list( qualifier cell value ) Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

15 Hadoop and HBase Both Google frameworks have open source implementations MapReduce is called Hadoop Google FS is Hadoop s Distributed FS (HDFS) Bigtable is HBase All implemented in JAVA HDFS works as an overlay on already installed FS, no need for specialized OS installations Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

16 Proposed Disk-based Access Structure in Hadoop/HBase The index stored in HBase as one big distributed table (map) The index build up process rewritten as a MapReduce job It takes object representations on the input and feeds the HBase table on the output It uses M-Tree to find m closest reference objects diminishes number of total distance computations The query processing algorithm rewritten as a MapReduce job Due to change of structure, no joins necessary In the map phase, targeted HBase splits are read, original distance estimated, ranked candidates emitted Due to a job starting overhead several queries processed in one batch Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

17 HBase Storage Outline row id list( qualifier cell value ) Data objects (originally 128 dimensional vectors) are ordered w.r.t. their closest reference object and such distance a row id is reference object id + distance data object s signature set of pairs: reference object id and d(o, RO) qualifier is a data object id + RO id, value is a d(o, RO) whole signature is on one row Recall: table is ordered by reference object id and distance (index), columns are ordered by object id Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

18 Indexing Structure Build Up Phase 2 billions SIFT data set used 250GB of binary descriptor data 15 machines used, 12 as workers 2 quad-core XEON processors having 2933 MHz, 24 GB RAM, 350GB on 15K rpm disk P = 20, 000: random reference objects selected m = 8: distances to reference objects stored Using MapReduce 50 hours to create the indexing structure 8 worker slots per machine = 96 worker slots in cluster HBase data consumes 144GB in a compressed table 750 millions of rows Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

19 Query Batch Processing Absolute Time Results data object checked candidate emitted time (s) 10 queries, s = 3 30 HBase splits to scan 2,977,925 data objects retrieved from HBase, 213,761 candidates emitted Framework initialization overhead clearly visible No reduce phase, each mapper emits already ranked candidates (emitted almost immediately after the scan starts) Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

20 Relative Time of Candidate Emission to Scan Start candidates emitted time (s) 10 queries, s = 3 30 HBase splits to scan Most of the candidates emitted within 4 seconds from start of scanning Further analysis needs to be done to find out why some tasks take considerably longer than others difference in tens of seconds Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

21 Conclusions Index build up phase is completely parallelized and scales linearly Framework s overhead is a bottleneck for real deployment MapReduce vs. Distributed Database job kept alive Further optimizations possible to diminish the candidate emission times (i.e. aggresive pruning, HBase tuning, etc.) Smaller radii mean faster processing 20,000 candidates with 0.5 recall = 10,000 true positives real life applications usually demand only hundred(s) Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22

22 Questions? Thank you for your attention. Supported by: WISDOM the French federation ( ) NEUMA the French project ANR CONTINT ( ) METACentrum (super)computing facilities provided under the Czech research intent MSM Stanislav Barton et al. (Dauphine) Similarity Search in a Very Large Scale November 3, / 22