Large-Scale Similarity and Distance Metric Learning

Similar documents
Simple and efficient online algorithms for real world applications

Big Data Analytics: Optimization and Randomization

Machine learning challenges for big data

Statistical Machine Learning

Big learning: challenges and opportunities

HT2015: SC4 Statistical Data Mining and Machine Learning

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Introduction to Online Learning Theory

Performance Metrics for Graph Mining Tasks

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Stochastic gradient methods for machine learning

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Beyond stochastic gradient descent for large-scale machine learning

Big Data Analytics CSCI 4030

Federated Optimization: Distributed Optimization Beyond the Datacenter

So which is the best?

Distance Metric Learning for Large Margin Nearest Neighbor Classification

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

The Power of Asymmetry in Binary Hashing

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Three New Graphical Models for Statistical Language Modelling

Linear Threshold Units

Weekly Sales Forecasting

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Feature Engineering in Machine Learning

Bilinear Prediction Using Low-Rank Models

Big Data - Lecture 1 Optimization reminders

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Factorization Machines

Principles of Data Mining by Hand&Mannila&Smyth

Making Sense of the Mayhem: Machine Learning and March Madness

Neural Networks Lesson 5 - Cluster Analysis

CSCI567 Machine Learning (Fall 2014)

Transform-based Domain Adaptation for Big Data

How To Understand The Theory Of Probability

An Introduction to Machine Learning

Distributed Structured Prediction for Big Data

Virtual Landmarks for the Internet

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Data Mining: Algorithms and Applications Matrix Math Review

CLUSTER ANALYSIS FOR SEGMENTATION

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Bayesian Active Distance Metric Learning

The Artificial Prediction Market

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Data Preprocessing. Week 2

Is that you? Metric Learning Approaches for Face Identification

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Support Vector Machines with Clustering for Training with Very Large Datasets

Visualization of Collaborative Data

The Many Facets of Big Data

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon

How To Improve Efficiency In Ray Tracing

Large Scale Learning to Rank

Comparing large datasets structures through unsupervised learning

Visualization by Linear Projections as Information Retrieval

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

LCs for Binary Classification

Machine learning and optimization for massive data

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

The p-norm generalization of the LMS algorithm for adaptive filtering

Data Mining - Evaluation of Classifiers

Cluster Analysis: Advanced Concepts

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

E-commerce Transaction Anomaly Classification

Cross-Validation. Synonyms Rotation estimation

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

A Simple Introduction to Support Vector Machines

Supervised Feature Selection & Unsupervised Dimensionality Reduction

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Classification by Pairwise Coupling

Unsupervised and supervised dimension reduction: Algorithms and connections

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

6. Cholesky factorization

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Course: Model, Learning, and Inference: Lecture 5

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Efficient online learning of a non-negative sparse autoencoder

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Learning to Find Pre-Images

Methods of Data Analysis Working with probability distributions

Transcription:

Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March 31, 2015

A bit about me PhD in Computer Science (Dec 2012) Université Jean Monnet, Saint-Etienne Advisors: Marc Sebban, Amaury Habrard Postdoc in 2013 2014 (18 months) University of Southern California, Los Angeles Working with Fei Sha Joined Télécom ParisTech in October Chaire Machine Learning for Big Data Working with Stéphan Clémençon

Outline of the talk 1. Introduction to Metric Learning 2. Learning (Infinitely) Many Local Metrics 3. Similarity Learning for High-Dimensional Sparse Data 4. Scaling-up ERM for Metric Learning

Introduction

Metric learning Motivation How to assign a similarity score to a pair of objects? Basic component of many algorithms: k-nn, clustering, kernel methods, ranking, dimensionality reduction, data visualization Crucial to performance of the above Obviously problem-dependent The machine learning way: let s learn it from data! We ll need some examples of similar / dissimilar things Learn to approximate the latent notion of similarity Can be thought of as representation learning Metric Learning

Metric learning Basic recipe 1. Pick a parametric form of distance or similarity function (Squared) Mahalanobis distance dm(x, 2 x ) = (x x ) T M(x x ) with M R d d symmetric PSD Bilinear similarity S M (x, x ) = x T Mx with M R d d 2. Collect similarity judgments on data pairs/triplets x i and x j are similar (or dissimilar) x i is more similar to x j than to x k 3. Estimate parameters such that metric best satisfies them Convex optimization and the like

Metric learning A famous algorithm: LMNN [Weinberger et al., 2005] Large Margin Nearest Neighbors min (1 µ) d M S d +,ξ 0 M 2 (x i, x j ) + µ (x i,x j ) S i,j,k s.t. d 2 M (x i, x k ) d 2 M (x i, x j ) 1 ξ ijk (x i, x j, x k ) R, S = {(x i, x j ) : y i = y j and x j belongs to the k-neighborhood of x i } R = {(x i, x j, x k ) : (x i, x j ) S, y i y k } ξ ijk

Metric learning Large-scale challenges How to efficiently learn multiple metrics? Multi-task learning Local linear metrics How to efficiently learn a metric for high-dimensional data? Full d d matrix not feasible anymore How to deal with large datasets? Number of terms in the objective can grow as O(n 2 ) or O(n 3 )

Learning (Infinitely) Many Local Metrics [Shi et al., 2014]

Learning (infinitely) many local metrics Main idea Assume we are given a basis dictionary B = {b k } K k=1 in Rd We will learn dw 2 (x, x ) = (x x ) T M(x x ) where K M = w k b k b T k, w 0 k=1 Use sparsity-inducing regularizer to do basis selection

Learning (infinitely) many local metrics Global and multi-task convex formulations C = { x i, x + i, x i min w R K + : d 2 w (x i, x + i ) should be smaller than d 2 w (x i, x i ) } C i=1 SCML-Global 1 C [ 1 + d 2 C w (x i, x + i ) dw 2 (x i, x i ) ] + + β w 1, i=1 where [ ] + = max(0, ) mt-scml : extension to T tasks T 1 C t [ min 1 + d 2 W R T + K C w t (x i, x + i ) dw 2 t (x i, x i ) ] + +β W 2,1 t t=1 i=1 W 2,1 : induce column-wise sparsity Metrics are task-specific but regularized to share features

Learning (infinitely) many local metrics Local metric learning Learn multiple local metrics : more flexibility One metric per region or class One metric per training instance Many issues How to define the local regions? How to generalize at test time? How to avoid a growing number of parameters? How to do avoid overfitting? Proposed approach : learn a metric for each point in space

Learning (infinitely) many local metrics Local nonconvex formulation We use the following parameterization : ( K ) d w (x, x ) = (x x ) T w k (x)b k b T k (x x ), k=1 w k (x) = (a T k φ(x) + c k) 2 : weight of basis k φ(x) R d : (nonlinear) projection of x (e.g., kernel PCA) ak R d and c k R : parameters to learn for each basis SCML-Local min à R (d +1) K + 1 C C [ 1 + d 2 w (x i, x + i ) dw 2 (x i, x i ) ] + + β à 2,1 i=1 à : stack A = [a 1... a K ] T and c If A = 0, recover the global formulation

Learning (infinitely) many local metrics Efficient optimization Nonsmooth objective with large number of terms Use stochastic proximal methods SCML-Global and mt-scml are convex Regularized Dual Averaging [Xiao, 2010] SCML-Local is nonconvex Forward-backward algorithm [Duchi and Singer, 2009] Initialize with solution of SCML-Global

Learning (infinitely) many local metrics Experiments Dataset Euc LMNN BoostML SCML-Global Vehicle 29.7±0.6 23.5±0.7 19.9±0.6 21.3±0.6 Vowel 11.1±0.4 10.8±0.4 11.4±0.4 10.9±0.5 Segment 5.2±0.2 4.6±0.2 3.8±0.2 4.1±0.2 Letters 14.0±0.2 11.6±0.3 10.8±0.2 9.0±0.2 USPS 10.3±0.2 4.1±0.1 7.1±0.2 4.1±0.1 BBC 8.8±0.3 4.0±0.2 9.3±0.3 3.9±0.2 Avg. rank 3.3 2.0 2.3 1.2 Dataset M 2 LMNN GLML PLML SCML-Local Vehicle 23.1±0.6 23.4±0.6 22.8±0.7 18.0±0.6 Vowel 6.8±0.3 4.1±0.4 8.3±0.4 6.1±0.4 Segment 3.6±0.2 3.9±0.2 3.9±0.2 3.6±0.2 Letters 9.4±0.3 10.3±0.3 8.3±0.2 8.3±0.2 USPS 4.2±0.7 7.8±0.2 4.1±0.1 3.6±0.1 BBC 4.9±0.4 5.7±0.3 4.3±0.2 4.1±0.2 Avg. rank 2.0 2.7 2.0 1.2

Learning (infinitely) many local metrics Experiments 1 2 3 (a) Classmembership (b) Trainedmetrics (c) Test metrics Number of selected basis 400 300 200 100 SCML Global SCML Local 0 50 400 700 1000 2000 Size of basis set Testing error rate 5.5 5 4.5 4 3.5 SCML Global SCML Local Euc 3 50 400 700 1000 2000 Size of basis set

Similarity Learning for High-Dimensional Sparse Data [Liu et al., 2015]

Similarity learning for high-dimensional sparse data Problem setting Assume data points are high-dimensional (d > 10 4 ) but D-sparse (on average) with D d Bags-of-words (text, image), bioinformatics, etc Existing metric learning algorithms fail Intractable: training cost O(d 2 ) to O(d 3 ), memory O(d 2 ) Severe overfitting Practitioners use dimensionality reduction (PCA, RP) Poor performance in presence of noisy features Resulting metric difficult to interpret for domain experts Contributions of this work Learn similarity in original high-dimensional space Time/memory costs independent of d Explicit control of similarity complexity

Similarity learning for high-dimensional sparse data A very large basis set We want to learn a similarity function S M (x, x ) = x T Mx Given λ > 0, for any i, j {1,..., d}, i j we define P (ij) λ λ λ = N (ij) λ λ λ = λ λ λ λ B λ = { } P (ij) λ, N(ij) λ ij M D λ = conv(b λ ) One basis involves only 2 features: S (ij) P (x, x ) = λ(x i x i + x j x j + x i x j + x j x i ) λ S (ij) N (x, x ) = λ(x i x i + x j x j x i x j x j x i ) λ

Figure adapted from [Jaggi, 2013] Similarity learning for high-dimensional sparse data Problem formulation and algorithm Optimization problem (smoothed hinge loss l) min M R d d s.t. f (M) = 1 C M D λ C i=1 ( l 1 x T i Mx + i + x T i Mx i Use a Frank-Wolfe algorithm [Jaggi, 2013] to solve it ) Let M (0) D λ for k = 0, 1,... do B (k) = arg min B Bλ B, f (M (k) ) M (k+1) = (1 γ)m (k) + γb (k) end for

Similarity learning for high-dimensional sparse data Convergence and complexity analysis Convergence Let L = 1 C C i=1 x i(x + i x i ) T 2 F. At any iteration k 1, the iterate M (k) D λ of the FW algorithm: has at most rank k + 1 with 4(k + 1) nonzero entries uses at most 2(k + 1) distinct features satisfies f (M (k) ) f (M ) 16Lλ 2 /(k + 2) An optimal basis can be found in O(CD 2 ) time and memory An approximately optimal basis can be found in O(mD 2 ) with m C using a Monte Carlo approximation of the gradient Or even O(mD) using a heuristic (good results in practice) Storing M (k) requires only O(k) memory Or even the entire sequence M (0),..., M (k) at the same cost

Similarity learning for high-dimensional sparse data Experiments K-NN test error on datasets with d up to 10 5 Datasets identity rp+oasis pca+oasis diag-l 2 diag-l 1 hdsl dexter 20.1 24.0 9.3 8.4 8.4 6.5 dorothea 9.3 11.4 9.9 6.8 6.6 6.5 rcv1 2 6.9 7.0 4.5 3.5 3.7 3.4 rcv1 4 11.2 10.6 6.1 6.2 7.2 5.7 Sparsity structure of the matrices (a) dexter (20, 000 20, 000 matrix, 712 nonzeros) (b) rcv1 4 (29, 992 29, 992 matrix, 5263 nonzeros)

Similarity learning for high-dimensional sparse data Experiments 250 2000 Number of Selected Features 200 150 100 50 Number of Selected Features 1500 1000 500 0 0 2000 4000 6000 8000 10000 iteration (a) dexter dataset 0 0 2000 4000 6000 8000 iteration (b) rcv1 4 dataset

Similarity learning for high-dimensional sparse data Experiments Error rate Error rate 50 45 40 35 30 25 20 15 10 5 40 35 30 25 20 15 10 5 HDSL RP RP+OASIS PCA PCA+OASIS Identity 10 1 10 2 10 3 10 4 projection space dimension (a) dexter dataset 10 1 10 2 10 3 10 4 projection space dimension (c) rcv1 2 dataset HDSL RP RP+OASIS PCA PCA+OASIS Identity Error rate Error rate 40 35 30 25 20 15 10 40 35 30 25 20 15 10 5 5 HDSL RP RP+OASIS PCA PCA+OASIS Identity 10 1 10 2 10 3 10 4 projection space dimension (b) dorothea dataset HDSL RP RP+OASIS PCA PCA+OASIS Identity 10 1 10 2 10 3 10 4 projection space dimension (d) rcv1 4 dataset

Scaling-up ERM for Metric Learning [Clémençon et al., 2015]

Scaling-up ERM for metric learning Empirical Minimization of U-Statistics Given an i.i.d. sample X 1,..., X n, the U-statistic of degree 2 with kernel H is given by 2 U n (H; θ) = H(X i, X j ; θ) n(n 1) i<j Can be generalized to higher degrees and multi-samples Empirical Minimization of U-statistic: min θ Θ U n (H; θ) Applies to metric learning for classification Also pairwise clustering, ranking, etc Number of terms quickly becomes huge MNIST dataset : n = 60000 2 10 9 pairs

Scaling-up ERM for metric learning Main idea and results Approximate U n (H) by an incomplete version Ũ B (H) Uniformly sample B terms (with replacement) Main result : if B = O(n), the learning rate is preserved O(1/ n) convergence in the general case Objective function has only O(n) terms v.s. O(n 2 ) initially Due to high dependence in the pairs Naive strategy : use complete U-statistic with fewer samples Uniformly sample O( n) samples Form all possible pairs : O(n) terms Learning rate is only O(1/n 1/4 ) Can use similar ideas in mini-batch SGD Reduce variance of gradient estimates

Scaling-up ERM for metric learning Experiments MNIST dataset : n = 60000 2 10 9 pairs Mini-batch SGD Step size tuning 50 random runs

Conclusion and perspectives Metric learning: useful in a variety of setting Ideas to overcome some key limitations Large datasets High dimensionality Learn many local metrics Details and more results in the papers One interesting challenge : distributed metric learning Existing work [Xie and Xing, 2014] too restrictive

References I [Clémençon et al., 2015] Clémençon, S., Bellet, A., and Colin, I. (2015). Scaling-up Empirical Risk Minimization: Optimization of Incomplete U-statistics. Technical report, arxiv:1501.02629. [Duchi and Singer, 2009] Duchi, J. and Singer, Y. (2009). Efficient Online and Batch Learning Using Forward Backward Splitting. JMLR, 10:2899 2934. [Jaggi, 2013] Jaggi, M. (2013). Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML. [Liu et al., 2015] Liu, K., Bellet, A., and Sha, F. (2015). Similarity Learning for High-Dimensional Sparse Data. In AISTATS. [Shi et al., 2014] Shi, Y., Bellet, A., and Sha, F. (2014). Sparse Compositional Metric Learning. In AAAI, pages 2078 2084. [Weinberger et al., 2005] Weinberger, K. Q., Blitzer, J., and Saul, L. K. (2005). Distance Metric Learning for Large Margin Nearest Neighbor Classification. In NIPS, pages 1473 1480.

References II [Xiao, 2010] Xiao, L. (2010). Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. JMLR, 11:2543 2596. [Xie and Xing, 2014] Xie, P. and Xing, E. (2014). Large Scale Distributed Distance Metric Learning. Technical report, arxiv:1412.5949.