Learning to Hash for Big Data

Transcription

1 Learning to Hash for Big Data oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn May 09, 2015 Li ( Learning to Hash LAMDA, CS, NJU 1 / 43

2 Outline 1 Introduction Problem Definition Existing Methods 2 Scalable Graph Hashing with Feature Transformation Motivation Model and Learning Experiment 3 Conclusion 4 Reference Li ( Learning to Hash LAMDA, CS, NJU 2 / 43

3 Introduction Outline 1 Introduction Problem Definition Existing Methods 2 Scalable Graph Hashing with Feature Transformation Motivation Model and Learning Experiment 3 Conclusion 4 Reference Li ( Learning to Hash LAMDA, CS, NJU 3 / 43

4 Introduction Problem Definition Nearest Neighbor Search (Retrieval) Given a query point q, return the points closest (similar) to q in the database (e.g., images). Underlying many machine learning, data mining, information retrieval problems Challenge in Big Data Applications: Curse of dimensionality Storage cost Query speed Li ( Learning to Hash LAMDA, CS, NJU 4 / 43

5 Introduction Problem Definition Similarity Preserving Hashing Li ( Learning to Hash LAMDA, CS, NJU 5 / 43

6 Introduction Problem Definition Reduce Dimensionality and Storage Cost Li ( Learning to Hash LAMDA, CS, NJU 6 / 43

7 Introduction Problem Definition Querying Hamming distance: , H = , H = 1 Li ( Learning to Hash LAMDA, CS, NJU 7 / 43

8 Introduction Problem Definition Fast Query Speed By using hashing-based index, we can achieve constant or sub-linear search time complexity. Exhaustive search is also acceptable because the distance calculation cost is cheap now. Li ( Learning to Hash LAMDA, CS, NJU 8 / 43

9 Introduction Problem Definition Hash Function Learning Easy or hard? Hard: discrete optimization problem Easy by approximation: two stages of hash function learning Projection stage (dimensionality reduction) Projected with real-valued projection function Given a point x, each projected dimension i will be associated with a real-valued projection function f i (x) (e.g. f i (x) = w T i x) Quantization stage Turn real into binary However, there exist essential differences between metric learning (dimensionality reduction) and learning to hash. Simply adapting traditional metric learning is not enough. Li ( Learning to Hash LAMDA, CS, NJU 9 / 43

10 Introduction Existing Methods Data-Independent Methods The hash function family is defined independently of the training dataset: Locality-sensitive hashing (LSH): (Gionis et al., 1999; Andoni and Indyk, 2008) and its extensions (Datar et al., 2004; Kulis and Grauman, 2009; Kulis et al., 2009). SIKH: Shift invariant kernel hashing (SIKH) (Raginsky and Lazebnik, 2009). Hash function: random projections. Li ( Learning to Hash LAMDA, CS, NJU 10 / 43

11 Introduction Existing Methods Data-Dependent Methods Hash functions are learned from a given training dataset (learning to hash). Relatively short codes Seminal papers: (Salakhutdinov and Hinton, 2007, 2009; Torralba et al., 2008; Weiss et al., 2008) Two categories: Unimodal Supervised methods given the labels y i or triplet (x i, x j, x k ) Unsupervised methods Multimodal Supervised methods Unsupervised methods Li ( Learning to Hash LAMDA, CS, NJU 11 / 43

12 Introduction Existing Methods (Unimodal) Unsupervised Methods No labels to denote the categories of the training points. PCAH: principal component analysis. SH: eigenfunctions computed from the data similarity graph (Weiss et al., 2008). ITQ: orthogonal rotation matrix to refine the initial projection matrix learned by PCA (Gong and Lazebnik, 2011). AGH: graph-based hashing (Liu et al., 2011). IsoHash: projected dimensions with isotropic variances (Kong and Li, 2012b). DGH: discrete graph hashing (Liu et al., 2014) etc. Li ( Learning to Hash LAMDA, CS, NJU 12 / 43

13 Introduction Existing Methods (Unimodal) Supervised (semi-supervised) Methods Class labels or pairwise constraints: SSH: semi-supervised hashing (SSH) exploits both labeled data and unlabeled data for hash function learning (Wang et al., 2010a,b). MLH: minimal loss hashing (MLH) based on the latent structural SVM framework (Norouzi and Fleet, 2011). KSH: kernel-based supervised hashing (Liu et al., 2012) LDAHash: linear discriminant analysis based hashing (Strecha et al., 2012) LFH: supervised hashing with latent factor models (Zhang et al., 2014) etc. Triplet-based methods: Hamming distance metric learning (HDML) (Norouzi et al., 2012) Column generation base hashing (CGHash) (Li et al., 2013) Li ( Learning to Hash LAMDA, CS, NJU 13 / 43

14 Introduction Existing Methods Multimodal Methods Multi-source hashing Cross-modal hashing Li ( Learning to Hash LAMDA, CS, NJU 14 / 43

15 Introduction Existing Methods Multi-Source Hashing Aims at learning better codes by leveraging auxiliary views than unimodal hashing. Assumes that all the views provided for a query, which are typically not feasible for many multimedia applications. Multiple feature hashing (Song et al., 2011) Composite hashing (Zhang et al., 2011) etc. Li ( Learning to Hash LAMDA, CS, NJU 15 / 43

16 Introduction Existing Methods Cross-Modal Hashing Given a query of either image or text, return images or texts similar to it. Cross View Hashing (CVH) (Kumar and Udupa, 2011) Multimodal Latent Binary Embedding (MLBE) (Zhen and Yeung, 2012a) Co-Regularized Hashing (CRH) (Zhen and Yeung, 2012b) Inter-Media Hashing (IMH) (Song et al., 2013) Relation-aware Heterogeneous Hashing (RaHH) (Ou et al., 2013) Semantic Correlation Maximization (SCM) (Zhang and Li, 2014) etc. Li ( Learning to Hash LAMDA, CS, NJU 16 / 43

17 Introduction Existing Methods Our Contribution Unsupervised Hashing : (1) Isotropic hashing [NIPS 2012] (2) Scalable graph hashing with feature transformation [IJCAI 2015] Supervised Hashing [SIGIR 2014]: Supervised hashing with latent factor models Multimodal Hashing [AAAI 2014]: Large-scale supervised multimodal hashing with semantic correlation maximization Multiple-Bit Quantization: (1) Double-bit quantization [AAAI 2012] (2) Manhattan quantization [SIGIR 2012] 5ŒêâMFÆSµyG ª³6[5 ÆÏ6,2015] Li ( Learning to Hash LAMDA, CS, NJU 17 / 43

18 Scalable Graph Hashing with Feature Transformation Outline 1 Introduction Problem Definition Existing Methods 2 Scalable Graph Hashing with Feature Transformation Motivation Model and Learning Experiment 3 Conclusion 4 Reference Li ( Learning to Hash LAMDA, CS, NJU 18 / 43

19 Scalable Graph Hashing with Feature Transformation Motivation (Unsupervised) Graph Hashing Guide the hashing code learning procedure by directly exploiting the pairwise similarity (neighborhood structure). min ij S ij b i b j 2 = tr(b T LB) subject to : b i { 1, 1} c b i = 0 i 1 n b i b T i i = I Should be expected to achieve better performance than non-graph based methods if the learning algorithms are effective. Li ( Learning to Hash LAMDA, CS, NJU 19 / 43

20 Scalable Graph Hashing with Feature Transformation Motivation Motivation However, it is difficult to design effective algorithms because both the memory and time complexity are at least O(n 2 ). SH (Weiss et al., 2008): Use an eigenfunction solution of 1-D Laplacian with uniform assumption BRE (Kulis and Darrell, 2009): Subsample a small subset for training AGH (Liu et al., 2011), DGH (Liu et al., 2014): Use anchor graph to approximate the similarity graph Li ( Learning to Hash LAMDA, CS, NJU 20 / 43

21 Scalable Graph Hashing with Feature Transformation Motivation Contribution How to utilize the whole graph and simultaneously avoid O(n 2 ) complexity? Scalable graph hashing (SGH): A feature transformation (Shrivastava and Li, 2014) method to effectively approximate the whole graph without explicitly computing it. A sequential method for bit-wise complementary learning. Linear complexity. Outperform state of the art in terms of both accuracy and scalability. Li ( Learning to Hash LAMDA, CS, NJU 21 / 43

22 Scalable Graph Hashing with Feature Transformation Model and Learning Notation X = {x 1,..., x n } T R n d : n data points. b i {+1, 1} c : binary code of point x i. b i = [h 1 (x i ),..., h c (x i )] T, where h k (x) denotes hash function. Pairwise similarity metric defined as: S ij = e x i x j 2 F /ρ (0, 1] Li ( Learning to Hash LAMDA, CS, NJU 22 / 43

23 Scalable Graph Hashing with Feature Transformation Model and Learning Objective Function min ij ( S ij 1 c bt i b j) 2 S ij = 2S ij 1 ( 1, 1]. Hash function: h k (x i ) = sgn( m j=1 W ijφ(x i, x j ) + bias) x i, map it into Hamming space as b i = [h 1 (x i ),, h c (x i )] T x, define: K(x) = [φ(x, x 1 ) n i=1 φ(x i, x 1 )/n,..., φ(x, x m ) n i=1 φ(x i, x m )/n] Objective function with the parameter W R c m : min W s.t. c S sgn(k(x)w T )sgn(k(x)w T ) T 2 F WK(X) T K(X)W T = I Li ( Learning to Hash LAMDA, CS, NJU 23 / 43

24 Scalable Graph Hashing with Feature Transformation Model and Learning Feature Transformation x, define P (x) and Q(x): 2(e P (x) = [ 2 1) eρ 2(e Q(x) = [ 2 1) eρ e x 2 F ρ e x; e e x 2 F ρ e x; e e x 2 F ρ ; 1] e x 2 F ρ ; 1] x i, x j X P (x i ) T Q(x j ) = 2[ e2 1 2e 2xT i x j ρ + e e 2e x i 2 F x j 2 F +2xT i x j ρ 1 = 2e x i x j ρ 2 F 1 = S ij ]e x i 2 F + x j 2 F ρ 1 Li ( Learning to Hash LAMDA, CS, NJU 24 / 43

25 Scalable Graph Hashing with Feature Transformation Model and Learning Feature Transformation Here, we use an approximation e2 1 2e x + e2 +1 2e e x We assume 1 2 ρ xt i x j 1. It is easy to prove that ρ = 2 max{ x i 2 F }n i=1 can make 1 2 ρ xt i x j 1. Then we have S P (X) T Q(X) Li ( Learning to Hash LAMDA, CS, NJU 25 / 43

26 Scalable Graph Hashing with Feature Transformation Model and Learning Sequential Learning Strategy Direct relaxation may lead to poor performance We adopt a sequential learning strategy in a bit-wise complementary manner Residual definition: R t = c S t 1 i=1 sgn(k(x)w i)sgn(k(x)w i ) T R 1 = c S Objective function: min w t By relaxation, we can get: R t sgn(k(x)w t )sgn(k(x)w t ) T 2 F s.t. w T t K(X) T K(X)w t = 1 min w t tr(w T t K(X) T R t K(X)w t ) s.t. w T t K(X) T K(X)w t = 1 Li ( Learning to Hash LAMDA, CS, NJU 26 / 43

27 Scalable Graph Hashing with Feature Transformation Model and Learning Sequential Learning Strategy Then we obtain a generalized eigenvalue problem: K(X) T R t K(X)w t = λk(x) T K(X)w t Define A t = K(X) T R t K(X) R m m, then: A t = A t 1 K(X) T sgn(k(x)w t 1 )sgn(k(x)w t 1 ) T K(X) Key component: A 1 = ck(x) T SK(X) = c[k(x) T P (X) T ][Q(X) K(X)] }{{} S Li ( Learning to Hash LAMDA, CS, NJU 27 / 43

28 Scalable Graph Hashing with Feature Transformation Model and Learning Sequential Learning Strategy Adopting the residual matrix: c R t = c S sgn(k(x)w i )sgn(k(x)w i ) T i=1,i t we can learn all the W = {w i } c i=1 for multiple rounds. This can further improve the accuracy. We continue it for one more round to get a good tradeoff between accuracy and speed. Li ( Learning to Hash LAMDA, CS, NJU 28 / 43

29 Scalable Graph Hashing with Feature Transformation Model and Learning Sequential Learning Algorithm Li ( Learning to Hash LAMDA, CS, NJU 29 / 43

30 Scalable Graph Hashing with Feature Transformation Model and Learning Complexity Analysis Initialization P (X) and Q(X): O(dn) Kernel initialization: O(dmn + mn) A 1 and Z: O((d + 2)mn) Main procedure O(c(mn + m 2 ) + m 3 ) Typically, m, d, c will be much less than n. Hence, the time complexity is O(n). Furthermore, the storage complexity is also O(n). Li ( Learning to Hash LAMDA, CS, NJU 30 / 43

31 Scalable Graph Hashing with Feature Transformation Experiment Dataset TINY-1M One million images from the set of 80M tiny images. Each tiny image is represented by a 384-dim GIST descriptors. MIRFLICKR-1M One million Flickr images. Extract 512-dim features from each image. Li ( Learning to Hash LAMDA, CS, NJU 31 / 43

32 Scalable Graph Hashing with Feature Transformation Experiment Evaluation Protocols and Baselines Evaluation Protocols: Ground truth: top 2% nearest neighbors in the training set in terms of Euclidean distance Hamming ranking: Top-K precision Training time for scalability Baselines: LSH (Andoni and Indyk, 2008) PCAH (Gong and Lazebnik, 2011) ITQ (Gong and Lazebnik, 2011) AGH (Liu et al., 2011) DGH (Liu et al., 2014): DGH-I and DGH-R Li ( Learning to Hash LAMDA, CS, NJU 32 / 43

33 Scalable Graph Hashing with Feature Transformation Experiment Top-1k Precision Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH DGH-I DGH-R PCAH LSH Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH DGH-I DGH-R PCAH LSH Li ( Learning to Hash LAMDA, CS, NJU 33 / 43

34 Scalable Graph Hashing with Feature Transformation Experiment Top-k Precision SGH ITQ AGH DGH I DGH R PCAH LSH #Returned Samples (a) 64 bit Precision SGH ITQ AGH DGH I DGH R PCAH LSH #Returned Samples (b) 128 bit Li ( Learning to Hash LAMDA, CS, NJU 34 / 43

35 Scalable Graph Hashing with Feature Transformation Experiment Top-k Precision SGH ITQ 0.4 AGH DGH I 0.3 DGH R PCAH LSH #Returned Samples (a) 64 bit Precision SGH ITQ AGH DGH I DGH R PCAH LSH #Returned Samples (b) 128 bit Li ( Learning to Hash LAMDA, CS, NJU 35 / 43

36 Scalable Graph Hashing with Feature Transformation Experiment Training Time (in second) Training Here, t 1 = Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH t t t t t 1 DGH-I t t t t t 1 DGH-R t t t t t 1 PCAH LSH Training Here, t 2 = Method 32 bits 64 bits 96 bits 128 bits 256 bits SGH ITQ AGH t t t t t 2 DGH-I t t t t t 2 DGH-R t t t t t 2 PCAH LSH Li ( Learning to Hash LAMDA, CS, NJU 36 / 43

37 Scalable Graph Hashing with Feature Transformation Experiment Sensitivity to Parameter ρ (a) TINY-1M (b) MIRFLICKR-1M Li ( Learning to Hash LAMDA, CS, NJU 37 / 43

38 Scalable Graph Hashing with Feature Transformation Experiment Sensitivity to Parameter m (a) TINY-1M (b) MIRFLICKR-1M Li ( Learning to Hash LAMDA, CS, NJU 38 / 43

39 Conclusion Outline 1 Introduction Problem Definition Existing Methods 2 Scalable Graph Hashing with Feature Transformation Motivation Model and Learning Experiment 3 Conclusion 4 Reference Li ( Learning to Hash LAMDA, CS, NJU 39 / 43

40 Conclusion Conclusion Hashing can significantly improve retrieval speed and reduce storage cost. It has become a hot research topic in big learning with wide applications. We have proposed a series of hashing methods, including unsupervised, supervised, and multimodal methods. Furthermore, some quantization strategies (Kong and Li, 2012a; Kong et al., 2012) are also designed. In particular, the details of SGH with feature transformation (Jiang and Li, 2015) are introduced in this talk. Li ( Learning to Hash LAMDA, CS, NJU 40 / 43

41 Conclusion Future Trends Discrete optimization and learning Scalable training for supervised hashing Quantization New applications 5ŒêâMFÆSµyG ª³6[5 ÆÏ6,2015] (Li and Zhou, 2015) Li ( Learning to Hash LAMDA, CS, NJU 41 / 43

42 Conclusion Q & A Thanks! Question? Code available at Li ( Learning to Hash LAMDA, CS, NJU 42 / 43

43 Reference Outline 1 Introduction Problem Definition Existing Methods 2 Scalable Graph Hashing with Feature Transformation Motivation Model and Learning Experiment 3 Conclusion 4 Reference Li ( Learning to Hash LAMDA, CS, NJU 43 / 43

44 References A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1): , M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the ACM Symposium on Computational Geometry, A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of International Conference on Very Large Data Bases, Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proceedings of Computer Vision and Pattern Recognition, Q.-Y. Jiang and W.-J. Li. Scalable graph hashing with feature transformation. In IJCAI, W. Kong and W.-J. Li. Double-bit quantization for hashing. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2012a. Li ( Learning to Hash LAMDA, CS, NJU 43 / 43

45 References W. Kong and W.-J. Li. Isotropic hashing. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012b. W. Kong, W.-J. Li, and M. Guo. Manhattan hashing for large-scale image retrieval. In The 35th International ACM SIGIR conference on research and development in Information Retrieval (SIGIR), B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Proceedings of Neural Information Processing Systems, B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In Proceedings of International Conference on Computer Vision, B. Kulis, P. Jain, and K. Grauman. Fast similarity search for learned metrics. IEEE Trans. Pattern Anal. Mach. Intell., 31(12): , S. Kumar and R. Udupa. Learning hash functions for cross-view similarity search. In IJCAI, pages , Li ( Learning to Hash LAMDA, CS, NJU 43 / 43

46 References W.-J. Li and Z.-H. Zhou. Learning to hash for big data: current status and future trends. Chinese Science Bulletin, 60: , X. Li, G. Lin, C. Shen, A. van den Hengel, and A. R. Dick. Learning hash functions using column generation. In ICML, W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In Proceedings of International Conference on Machine Learning, W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, pages , W. Liu, C. Mu, S. Kumar, and S. Chang. Discrete graph hashing. In Proceedings of Neural Information Processing Systems, M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In Proceedings of International Conference on Machine Learning, M. Norouzi, D. J. Fleet, and R. Salakhutdinov. Hamming distance metric learning. In NIPS, pages , M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, and S. Yang. Comparing apples to oranges: a scalable solution with heterogeneous hashing. In KDD, pages , Li ( Learning to Hash LAMDA, CS, NJU 43 / 43

47 References M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In Proceedings of Neural Information Processing Systems, R. Salakhutdinov and G. Hinton. Semantic Hashing. In SIGIR workshop on Information Retrieval and applications of Graphical Models, R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7): , A. Shrivastava and P. Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In NIPS, J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In ACM Multimedia, pages , J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In SIGMOD Conference, pages , C. Strecha, A. A. Bronstein, M. M. Bronstein, and P. Fua. Ldahash: Improved matching with smaller descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 34(1):66 78, Li ( Learning to Hash LAMDA, CS, NJU 43 / 43

48 References A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In Proceedings of Computer Vision and Pattern Recognition, J. Wang, S. Kumar, and S.-F. Chang. Sequential projection learning for hashing with compact codes. In Proceedings of International Conference on Machine Learning, 2010a. J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale image retrieval. In Proceedings of Computer Vision and Pattern Recognition, 2010b. Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proceedings of Neural Information Processing Systems, D. Zhang and W.-J. Li. Large-scale supervised multimodal hashing with semantic correlation maximization. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), D. Zhang, F. Wang, and L. Si. Composite hashing with multiple information sources. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, Li ( Learning to Hash LAMDA, CS, NJU 43 / 43

49 P. Zhang, W. Zhang, W.-J. Li, and M. Guo. Supervised hashing with latent factor models. In SIGIR, Y. Zhen and D.-Y. Yeung. A probabilistic model for multimodal hash function learning. In KDD, pages , 2012a. Y. Zhen and D.-Y. Yeung. Co-regularized hashing for multimodal data. In NIPS, pages , 2012b. () May 12, / 43