Regularized Distance Metric Learning: Theory and Algorithm



Similar documents
Modified Line Search Method for Global Optimization

Asymptotic Growth of Functions

Plug-in martingales for testing exchangeability on-line


Chapter 7 Methods of Finding Estimators

A probabilistic proof of a binomial identity

Department of Computer Science, University of Otago

Output Analysis (2, Chapters 10 &11 Law)

LECTURE 13: Cross-validation

Soving Recurrence Relations

THE ABRACADABRA PROBLEM

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

Convexity, Inequalities, and Norms

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

Sequences and Series

Coordinating Principal Component Analyzers

Properties of MLE: consistency, asymptotic normality. Fisher information.

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Research Article Sign Data Derivative Recovery

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

1 Computing the Standard Deviation of Sample Means

Totally Corrective Boosting Algorithms that Maximize the Margin

I. Chi-squared Distributions

Confidence Intervals for One Mean

Systems Design Project: Indoor Location of Wireless Devices

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

Lecture 2: Karger s Min Cut Algorithm

Reliability Analysis in HPC clusters

Automatic Tuning for FOREX Trading System Using Fuzzy Time Series

Chapter 14 Nonparametric Statistics

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Determining the sample size

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE

Maximum Likelihood Estimators.

Incremental calculation of weighted mean and variance

Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Review: Classification Outline

arxiv: v1 [stat.ml] 30 Jun 2015

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

5 Boolean Decision Trees (February 11)

TIGHT BOUNDS ON EXPECTED ORDER STATISTICS

Infinite Sequences and Series

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

Class Meeting # 16: The Fourier Transform on R n

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

Domain 1: Designing a SQL Server Instance and a Database Solution

(VCP-310)

Recovery time guaranteed heuristic routing for improving computation complexity in survivable WDM networks

NEW HIGH PERFORMANCE COMPUTATIONAL METHODS FOR MORTGAGES AND ANNUITIES. Yuri Shestopaloff,

An Efficient Polynomial Approximation of the Normal Distribution Function & Its Inverse Function

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Universal coding for classes of sources

On Formula to Compute Primes. and the n th Prime

MARTINGALES AND A BASIC APPLICATION

Spam Detection. A Bayesian approach to filtering spam

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

Swaps: Constant maturity swaps (CMS) and constant maturity. Treasury (CMT) swaps

THIN SEQUENCES AND THE GRAM MATRIX PAMELA GORKIN, JOHN E. MCCARTHY, SANDRA POTT, AND BRETT D. WICK

THE HEIGHT OF q-binary SEARCH TREES

Finding the circle that best fits a set of points

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

PSYCHOLOGICAL STATISTICS

Stochastic Online Scheduling with Precedence Constraints

Lesson 15 ANOVA (analysis of variance)

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

A CUSUM TEST OF COMMON TRENDS IN LARGE HETEROGENEOUS PANELS

Evaluating Model for B2C E- commerce Enterprise Development Based on DEA

Overview on S-Box Design Principles

Stackelberg Games for Adversarial Prediction Problems

Statistical Learning Theory

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

CHAPTER 3 THE TIME VALUE OF MONEY

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

Chapter 5: Inner Product Spaces

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

STUDENTS PARTICIPATION IN ONLINE LEARNING IN BUSINESS COURSES AT UNIVERSITAS TERBUKA, INDONESIA. Maya Maria, Universitas Terbuka, Indonesia

FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. 1. Powers of a matrix

1 The Gaussian channel

INVESTMENT PERFORMANCE COUNCIL (IPC)

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Transcription:

Regularized Distace Metric Learig: Theory ad Algorithm Rog Ji 1 Shiju Wag 2 Yag Zhou 1 1 Dept. of Computer Sciece & Egieerig, Michiga State Uiversity, East Lasig, MI 48824 2 Radiology ad Imagig Scieces, Natioal Istitutes of Health, Bethesda, MD 20892 rogji@cse.msu.edu wagshi@cc.ih.gov zhouyag@msu.edu Abstract I this paper, we examie the geeralizatio error of regularized distace metric learig. We show that with appropriate costraits, the geeralizatio error of regularized distace metric learig could be idepedet from the dimesioality, makig it suitable for hadlig high dimesioal data. I additio, we preset a efficiet olie learig algorithm for regularized distace metric learig. Our empirical studies with data classificatio ad face recogitio show that the proposed algorithm is (i) effective for distace metric learig whe compared to the state-of-the-art methods, ad (ii) efficiet ad robust for high dimesioal data. 1 Itroductio Distace metric learig is a fudametal problem i machie learig ad patter recogitio. It is critical to may real-world applicatios, such as iformatio retrieval, classificatio, ad clusterig. Numerous algorithms have bee proposed ad examied for distace metric learig. They are usually classified ito two categories: usupervised metric learig ad supervised metric learig. Usupervised distace metric learig, or sometimes referred to as maifold learig, aims to lear a uderlyig low-dimesioal maifold where the distace betwee most pairs of data poits are preserved. Example algorithms i this category iclude ISOMAP [10] ad Local Liear Embeddig (LLE) [6]. Supervised metric learig attempts to lear distace metrics from side iformatio such as labeled istaces ad pairwise costraits. It searches for the optimal distace metric that (a) keeps data poits of the same classes close, ad (b) keeps data poits from differet classes far apart. Example algorithms i this category iclude [13, 8, 12, 5, 11, 15, 4]. I this work, we focus o supervised distace metric learig. Although a large umber of studies were devoted to supervised distace metric learig (see the survey i [14] ad refereces therei), few studies address the geeralizatio error of distace metric learig. I this paper, we examie the geeralizatio error for regularized distace metric learig. Followig the idea of stability aalysis [1], we show that with appropriate costraits, the geeralizatio error of regularized distace metric learig is idepedet from the dimesioality of data, makig it suitable for hadlig high dimesioal data. I additio, we preset a olie learig algorithm for regularized distace metric learig, ad show its regret boud. Note that although olie metric learig was studied i [7], our approach is advatageous i that (a) it is computatioally more efficiet i hadlig the costrait of SDP coe, ad (b) it has a proved regret boud while [7] oly shows a mistake boud for the datasets that ca be separated by a Mahalaobis distace. To verify the efficacy ad efficiecy of the proposed algorithm for regularized distace metric learig, we coduct experimets with data classificatio ad face recogitio. Our empirical results show that the proposed olie algorithm is (1) effective for metric learig compared to the state-of-the-art methods, ad (2) robust ad efficiet for high dimesioal data. 1

2 Regularized Distace Metric Learig Let D = {z i = (x i, y i ), i = 1,..., } deote the labeled examples, where x k = (x 1 k,..., xd k ) Rd is a vector of d dimesio ad y i {1, 2,..., m} is class label. I our study, we assume that the orm of ay example is upper bouded by R, i.e., sup x x 2 R. Let A S d d + be the distace metric to be leared, where the distace betwee two data poits x ad x is calculated as x x 2 A = (x x ) A(x x ). Followig the idea of maximum margi classifiers, we have the followig framework for regularized distace metric learig: 1 mi A 2 A 2 F + 2C g ( [ y i,j 1 xi x j 2 ]) A : A 0, tr(a) η(d) (1) ( 1) where i<j y i,j is derived from class labels y i ad y j, i.e., y i,j = 1 if y i = y j ad 1 otherwise. g(z) is the loss fuctio. It outputs a small value whe z is a large positive value, ad a large value whe z is large egative. We assume g(z) to be covex ad Lipschitz cotiuous with Lipschitz costat L. A 2 F is the regularizer that measures the complexity of the distace metric A. tr(a) η(d) is itroduced to esure a bouded domai for A. As will be revealed later, this costrait will become active oly whe the costrait costat η(d) is subliear i d, i.e., η O(d p ) with p < 1. We will also show how this costrait could affect the geeralizatio error of distace metric learig. 3 Geeralizatio Error Let A D be the distace metric leared by the algorithm i (1) from the traiig examples D. Let I D (A) deote the empirical loss, i.e., 2 I D (A) = g ( [ y i,j 1 xi x j 2 ]) A (2) ( 1) i<j For the coveiece of presetatio, we also write g ( y i,j (1 x i x j 2 A )) = V (A, z i, z j ) to highlight its depedece o A ad two examples z i ad z j. We deote by I(A) the loss of A over the true distributio, i.e., I(A) = E (zi,z j)[v (A, z i, z j )] (3) Give the empirical loss I D (A) ad the loss over the true distributio I(A), we defie the estimatio error as D D = I(A D ) I D (A D ) (4) I order to show the behavior of estimatio error, we follow the aalysis based o the stability of the algorithm [1]. The uiform stability of a algorithm determies the stability of the algorithm whe oe of the traiig examples is replaced with aother. More specifically, a algorithm A has uiform stability β if (D, z), i, sup V (A D, u, v) V (A D z,i, u, v) β (5) u,v where D z,i stads for the ew traiig set that is obtaied by replacig z j D with a ew example z. We further defie β = κ/ as the uiform stability β behaves like O(1/). The advatage of usig stability aalysis for the geeralizatio error of regularized distace metric learig. This is because the example pair (z i, z j ) used for traiig distace metrics are ot I.I.D. although z i is, makig it difficult to directly utilize the results from statistical learig theory. I the aalysis below, we first show how to derive the geeralizatio error boud for regularized distace metric learig give the uiform stability β (or κ). We the derive the uiform stability costat for the regularized distace metric learig framework i (1). 2

3.1 Geeralizatio Error Boud for Give Uiform Stability Aalysis i this sectio follows closely [1], ad we therefore omit the detailed proofs. Our aalysis utilizes the McDiarmid iequality that is stated as follows. Theorem 1. (McDiarmid Iequality) Give radom variables {v i } l i=1, v i, ad a fuctio F : vl R satisfyig sup F (v 1,..., v l ) F (v 1,..., v i 1, v i, v i+1,..., v l ) c i, v 1,...,v l,v i the followig statemet holds Pr ( F (v 1,..., v l ) E(F (v 1,..., v l )) > ɛ) 2 exp 2ɛ P l i=1 c2 i! To use the McDiarmid iequality, we first compute E(D D ). Lemma 1. Give a distace metric learig algorithm A has uiform stability κ/, we have the followig iequality for E(D D ) E(D D ) 2 κ (6) where is the umber of traiig examples i D. The result i the followig lemma shows that the coditio i McDiarmid iequality holds. Lemma 2. Let D be a collectio of radomly selected traiig examples, ad D i,z be the collectio of examples that replaces z i i D with example z. We have D D D D i,z bouded as follows D D D D i,z 2κ + 8Lη(d) + 2g 0 where g 0 = sup z,z V (0, z, z ) measures the largest loss whe distace metric A is 0. (7) Combiig the results i Lemma 1 ad 2, we ca ow derive the the boud for the geeralizatio error by usig the McDiarmid iequality. Theorem 2. Let D deote a collectio of radomly selected traiig examples, ad A D be the distace metric leared by the algorithm i (1) whose uiform stability is κ/. With probability 1 δ, we have the followig boud for I(A D ) I(A D ) I D (A D ) 2κ + (2κ + 4Lη(d) + 2g 0) l(2/δ) 2 (8) 3.2 Geeralizatio Error for Regularized Distace Metric Learig First, we show that the superium of tr(a D ) is O(d 1/2 ), which verifies that η(d) should behave subliear i d. This is summarized by the followig propositio. Propositio 1. The trace costrait i (1) will be activated oly whe where g 0 = sup z,z V (0, z, z ). η(d) 2dg 0 C (9) Proof. It follows directly from [tr(a D )/d] 2 A D 2 F 2C sup z,z V (0, z, z ) Cg 0. To boud the uiform stability, we eed the followig propositio Propositio 2. For ay two distace metrics A ad A, we have the followig iequality hold for ay examples z u ad z v V (A, z u, z v ) V (A, z u, z v ) 4LR 2 A A F (10) 3

The above propositio follows directly from the fact that (a) V (A, z, z ) is Lipschitz cotiuous ad (b) x 2 R for ay example x. The followig lemma bouds A D A D F. Lemma 3. Let D deote a collectio of radomly selected traiig examples, ad by z = (x, y) a radomly selected example. Let A D be the distace metric leared by the algorithm i (1). We have A D A D i,z F 8CLR2 (11) The proof of the above lemma ca be foud i Appedix A. By puttig the results i Lemma 3 ad Propositio 2, we have the followig theorem for the stability of the Frobeius orm based regularizer. Theorem 3. The uiform stability for the algorithm i (1) usig the Frobeius orm regularizer, deoted by β, is bouded as follows where κ = 32CL 2 R 4 β = κ 32CL2 R 4 Combig Theorem 3 ad 2, we have the followig theorem for the geeralizatio error of distace metric learig algorithm i (1) usig the Frobeius orm regularizer Theorem 4. Let D be a collectio of radomly selected examples, ad A D be the distace metric leared by the algorithm i (1) with h(a) = A 2 F. With probability 1 δ, we have the followig boud for the true loss fuctio I(A D ) where A D is leared from (1) usig the Frobeius orm regularizer (12) I(A D ) I D (A D ) 32CL2 R 4 where s(d) = mi ( 2dg 0 C, η(d) ). + ( 32CL 2 R 4 + 4Ls(d) + 2g 0 ) l(2/δ) 2 (13) Remark The most importat feature i the estimatio error is that it coverges i the order of O(s(d)/ ). By choosig η(d) to have a low depedece of d (i.e., η(d) d p with p 1), the proposed framework for regularized distace metric learig will be robust to the high dimesioal data. I the extreme case, by settig η(d) to be a costat, the estimatio error will be idepedet from the dimesioality of data. 4 Algorithm I this sectio, we discuss a efficiet algorithm for solvig (1). We assume a hige loss for g(z), i.e., g(z) = max(0, b z), where b is the classificatio margi. To desig a olie learig algorithm for regularized distace metric learig, we follow the theory of gradiet based olie learig [2] by defiig potetial fuctio Φ(A) = A 2 F /2. Algorithm 1 shows the olie learig algorithm. The theorem below shows the regret boud for the olie learig algorithm i Figure 1. Theorem 5. Let the olie learig algorithm 1 ru with learig rate λ > 0 o a sequece (x t, x t), y t, t = 1,...,. Assume x 2 R for all the traiig examples. The, for all distace metric M S+ d d, we have ( 1 L 1 8R 4 L (M) + 1 ) λ/b 2λ M 2 F where Ł (M) = max ( 0, b y t (1 x t x t 2 M ) ), L = ( ) max 0, b y t (1 x t x t 2 A t 1 ) 4

Algorithm 1 Olie Learig Algorithm for Regularized Distace Metric Learig 1: INPUT: predefied learig rate λ 2: Iitialize A 0 = 0 3: for t = 1,..., T do 4: Receive a pair of traiig examples {(x 1 t, y 1 t ), (x 2 t, y 2 t )} 5: Compute the class label y t : y t = +1 if y 1 t = y 2 t, ad y t = 1 otherwise. 6: if the traiig pair (x 1 t, x 2 t ), y t is classified correctly, i.e., y t ( 1 x 1 t x 2 t 2 A t 1 ) > 0 the 7: A t = A t 1. 8: else 9: A t = π S+ (A t 1 λy t (x t x t)(x t x t) ), where π S+ (M) projects matrix M ito the SDP coe. 10: ed if 11: ed for The proof of this theorem ca be foud i Appedix B. Note that the above olie learig algorithm require computig π S+ (M), i.e., projectig matrix M oto the SDP coe, which is expesive for high dimesioal data. To address this challege, first otice that M = π S+ (M) is equivalet to the optimizatio problem M = arg mi M 0 M M F. We thus approximate A t = π S+ (A t 1 λy t (x t x t)(x t x t) ) with A t = A t 1 λ t y t (x t x t)(x t x t) where λ t is computed as follows λ t = arg mi { λt λ : λ t [0, λ], A t 1 λ t y t (x t x t)(x t x t) 0 } (14) λ t The followig theorem shows the solutio to the above optimizatio problem. Theorem 6. The optimal solutio λ t to the problem i (14) is expressed as { λ yt = 1 λ t = mi ( λ, [(x t x t) A 1 t 1 (x t x t)] 1) y t = +1 Proof of this theorem ca be foud i the supplemetary materials. Fially, the quatity (x t x t)a 1 t 1 (x t x t) ca be computed by solvig the followig optimizatio problem max u 2u (x t x t) u Au whose optimal value ca be computed efficietly usig the cojugate gradiet method [9]. Note that compared to the olie metric learig algorithm i [7], the proposed olie learig algorithm for metric learig is advatageous i that (i) it is computatioally more efficiet by avoidig projectig a matrix ito a SDP coe, ad (ii) it has a provable regret boud while [7] oly presets the mistake boud for the separable datasets. 5 Experimets We coducted a extesive study to verify both the efficiecy ad the efficacy of the proposed algorithms for metric learig. For the coveiece of discussio, we refer to the propoesd olie distace metric learig algorithm as olie-reg. To examie the efficacy of the leared distace metric, we employed the k Nearest Neighbor (k-nn) classifier. Our hypothesis is that the better the distace metric is, the higher the classificatio accuracy of k-nn will be. We set k = 3 for k-nn for all the experimets accordig to our experiece. We compare our algorithm to the followig six state-of-the-art algorithms for distace metric learig as baselies: (1) Euclidea distace metric; (2) Mahalaobis distace metric, which is computed as the iverse of covariace matrix of traiig samples, i.e., ( i=1 x ix i ) 1 ; (3) Xig s algorithm proposed i [13]; (4) LMNN, a distace metric learig algorithm based o the large margi earest eighbor classifier [12]; (5) ITML, a Iformatio-theoretic metric learig based o [4]; ad (6) Relevace Compoet Aalysis (RCA) [8]. We set the maximum umber of iteratios for Xig s method to be 10, 000. The umber of target eighbors i LMNN ad parameter γ i ITML 5

Table 1: Classificatio error (%) of a k-nn (k = 3) classifier o the te UCI data sets usig seve differet metrics. Stadard deviatio is icluded. Dataset Eclidea Mahala Xig LMNN ITML RCA Olie-reg 1 19.5 ± 2.2 18.8 ± 2.5 29.3 ± 17.2 13.8 ± 2.5 8.6 ± 1.7 17.4 ± 1.5 13.2 ± 2.2 2 39.9 ± 2.3 6.7 ± 0.6 40.1 ± 2.6 3.6 ± 1.1 40.0 ± 2.3 3.8 ± 0.4 3.7 ± 1.2 3 36.0 ± 2.0 42.1 ± 4.0 43.5 ± 12.5 33.1 ± 0.6 39.8 ± 3.3 41.6 ± 0.7 37.3 ± 4.1 4 4.0 ± 1.7 10.4 ± 2.7 3.1 ± 2.0 3.9 ± 1.6 3.2 ± 1.6 2.9 ± 1.5 3.2 ± 1.3 5 30.6 ± 1.9 29.1 ± 2.1 30.6 ± 1.9 29.6 ± 1.8 28.8 ± 2.1 28.6 ± 2.3 27.7 ± 1.3 6 25.4 ± 4.2 18.4 ± 3.4 23.3 ± 3.4 15.2 ± 3.1 17.1 ± 4.1 13.9 ± 2.2 12.9 ± 2.2 7 31.9 ± 2.8 10.0 ± 2.8 24.6 ± 7.5 4.5 ± 2.4 28.7 ± 3.7 1.8 ± 1.5 1.8 ± 1.1 8 18.9 ± 0.5 37.3 ± 0.5 16.1 ± 0.6 18.4 ± 0.4 23.3 ± 1.3 30.6 ± 0.7 19.8 ± 0.6 9 2.0 ± 0.4 6.1 ± 0.5 12.4 ± 0.8 1.6 ± 0.3 2.5 ± 0.4 2.8 ± 0.4 2.9 ± 0.4 Table 2: p-values of the Wilcoxo siged-rak test of the 7 methods o the 9 datasets. Methods Eclidea Mahala Xig LMNN ITML RCA Olie-reg Euclidea 1.000 0.734 0.641 0.004 0.496 0.301 0.129 Mahala 0.734 1.000 0.301 0.008 0.570 0.004 0.004 Xig 0.641 0.301 1.000 0.027 0.359 0.074 0.027 LMNN 0.004 0.008 0.027 1.000 0.129 0.496 0.734 ITML 0.496 0.570 0.359 0.129 1.000 0.820 0.164 RCA 0.301 0.004 0.074 0.496 0.820 1.000 0.074 Olie-reg 0.129 0.004 0.027 0.734 0.164 0.074 1.000 were tued by cross validatio over the rage from 10 4 to 10 4. All the algorithms are implemeted ad ru usig Matlab. All the experimet are ru o a AMD Processor 2.8G machie, with 8GMB RAM ad Liux operatio system. 5.1 Experimet (I): Compariso to State-of-the-art Algorithms We coducted experimets of data classificatio over the followig ie datasets from UCI repository: (1) balace-scale, with 3 classes, 4 features, ad 625 istaces; (2) breast-cacer, with 2 classes, 10 features, ad 683 istace; (3) glass, with 6 classes, 9 features, ad 214 istaces; (4) iris, with 3 classes, 4 features, ad 150 istaces; (5) pima, with 2 classes, 8 features, ad 768 istaces; (6) segmetatio, with 7 classes, 19 features, ad 210 istaces; (7)wie, with 3 classes, 13 features, ad 178 istaces; (8) waveform, with 3 classes, 21 features, ad 5000 istaces; (9) optdigits, with 10 classes, 64 features, 3823 istaces. For all the datasets, we radomly select 50% samples for traiig, ad use the remaiig samples for testig. Table 1 shows the classificatio errors of all the metric learig methods over 9 datasets averaged over 10 rus, together with the stadard deviatio. We observe that the proposed metric learig algorithm deliver performace that comparable to the state-of-the-art methods. I particular, for almost all datasets, the classificatio accuracy of the proposed algorithm is close to that of LMNN, which has yielded overall the best performace amog six baselie algorithms. This is cosistet with the results of the other studies, which show LMNN is amog the most effective algorithms for distace metric learig. To further verify if the proposed method performs statistically better tha the baselie methods, we coduct statistical test by usig Wilcoxo siged-rak test [3]. The Wilcoxo siged-rak test is a o-parametric statistical hypothesis test for the comparisos of two related samples. It is kow to be safer tha the Studet s t-test because it does ot assume ormal distributios. From table 2, we fid that the regularized distace metric learig improves the classificatio accuracy sigificatly compared to Mahalaobis distace, Xig s method ad RCA at sigificat level 0.1. It performs slightly better tha ITML ad is comparable to LMNN. 6

1 att face 7000 att face Classificatio accuracy 0.9 0.8 0.7 0.6 0.5 Euclidea 0.4 Mahalaobis 0.3 LMNN ITML 0.2 RCA Olie_reg 0.1 0.1 0.12 0.14 0.16 0.18 0.2 Image resize ratio (a) Ruig time (secods) 6000 5000 4000 3000 2000 1000 LMNN ITML RCA Olie_reg 0 0.1 0.12 0.14 0.16 0.18 0.2 Image resize ratio (b) Figure 1: (a) Face recogitio accuracy of knn ad (b) ruig time of LMNN, ITML, RCA ad olie reg algorithms o the att-face dataset with varyig image sizes. 5.2 Experimet (II): Results for High Dimesioal Data To evaluate the depedece of the regularized metric learig algorithms o data dimesios, we tested it by the task of face recogitio. The AT&T face database 1 is used i our study. It cosists of grey images of faces from 40 distict subjects, with te pictures for each subject. For every subject, the images were take at differet times, with varied the lightig coditio ad differet facial expressios (ope/closed-eyes, smilig/ot-smilig) ad facial details (glasses/o-glasses). The origial size of each image is 112 92 pixels, with 256 grey levels per pixel. To examie the sesitivity to data dimesioality, we vary the data dimesio (i.e., the size of images) by compressig the origial images ito size differet sizes with the image aspect ratio preserved. The image compressio is achieved by bicubic iterpolatio (the output pixel value is a weighted average of pixels i the earest 4-by-4 eighborhood). For each subject, we radomly spit its face images ito traiig set ad test set with ratio 4 : 6. A distace metric is leared from the collectio of traiig face images, ad is used by the knn classifier (k = 3) to predict the subject ID of the test images. We coduct each experimet 10 times, ad report the classificatio accuracy by averagig over 40 subjects ad 10 rus. Figure 1 (a) shows the average classificatio accuracy of the knn classifier usig differet distace metric learig algorithms. The ruig times of differet metric learig algorithms for the same dataset is show i Figure 1 (b). Note that we exclude Xig s method i compariso because its extremely log computatioal time. We observed that with icreasig image size (dimesios), the regularized distace metric learig algorithm yields stable performace, idicatig that the it is resiliet to high dimesioal data. I cotrast, for almost all the baselie methods except ITML, their performace varied sigificatly as the size of the iput image chaged. Although ITML yields stable performace with respect to differet size of images, its high computatioal cost (Figure 1), arisig from solvig a Bregma optimizatio problem i each iteratio, makes it usuitable for high-dimesioal data. 6 Coclusio I this paper, we aalyze the geeralizatio error of regularized distace metric learig. We show that with appropriate costrait, the regularized distace metric learig could be robust to high dimesioal data. We also preset efficiet learig algorithms for solvig the related optimizatio problems. Empirical studies with face recogitio ad data classificatio show the proposed approach is (i) robust ad efficiet for high dimesioal data, ad (ii) comparable to the state-of-theart approaches for distace learig. I the future, we pla to ivestigate differet regularizers ad their effect for distace metric learig. 1 http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html 7

ACKNOWLEDGEMENTS The work was supported i part by the Natioal Sciece Foudatio (IIS-0643494) ad the U. S. Army Research Laboratory ad the U. S. Army Research Office (W911NF-09-1-0421). Ay opiios, fidigs, ad coclusios or recommedatios expressed i this material are those of the authors ad do ot ecessarily reflect the views of NSF ad ARO. Appedix A: Proof of Lemma 3 Proof. We itroduce the Bregme divergece for the proof of this lemma. Give a covex fuctio of matrix ϕ(x), the Bregme divergece betwee two matrices A ad B is computed as follows: d ϕ (A, B) = ϕ(b) ϕ(a) tr ( ϕ(a) (B A) ) We defie covex fuctio N(X) ad V D (X) as follows: N(X) = X 2 F, V D (X) = 2 ( 1) V (X, z i, z j ) ad furthermore covex fuctio T D (X) = N(X) + CV D (X). We thus have d N (A D, A D i,z) + d N (A D i,z, A D ) d TD (A D, A D i,z) + d TD i,z (A D i,z, A D ) C = [V (A ( 1) D i,z, z i, z j ) V (A D i,z, z, z j ) + V (A D, z, z j ) V (A D, z i, z j )] j i 8CLR2 A D A D i,z F The first iequality follows from the fact that both N(X) ad V D (X) are covex i X. The secod step holds because matrix A D ad A D i,z miimize the objective fuctio T D (X) ad T D i,z(x), respectively, ad therefore (A D i,z A D ) T D (A D ) 0, (A D A D i,z) T D i,z(a D i,z) 0 Sice d N (A, B) = A B 2 F, we therefore have A D A D i,z 2 F 8CLR2 A D A D i,z F, which leads to the result i the lemma. Appedix B: Proof of Theorem 7 Proof. We deote by A t = A t 1 λy(x t x t)(x t x t) ad A t = π S+ (A t). Followig Theorem 11.1 ad Theorem 11.4 [2], we have L L (M) 1 λ D Φ (M, A 0) + 1 D Φ (A t 1, A λ t) where i<j D Φ (A, B) = 1 2 A B 2 F, Φ(A) = Φ (A) = 1 2 A 2 F Usig the relatio A t = A t 1 λy(x t x t)(x t x t) ad A 0 = 0, we have L L (M) 1 2λ M 2 F + 1 [ ] I y t (1 x t x 2λ t 2 A t 1 ) < 0 x t x t 4 By assumig x 2 R for ay traiig example, we have x t x t 4 2 16R 4. Sice [ ] I y t (1 x t x t 2 A t 1 ) < 0 x t x t 4 max(0, b y t (1 x t x t 2 A t 1 )) 16R4 b we thus have the result i the theorem = 16R4 b L 8

Refereces [1] Bousquet, Olivier, ad Adré Elisseeff. Stability ad geeralizatio. Joural of Machie Learig Research, 2:499 526, March 2002. [2] Nicolo Cesa-Biachi ad Gabor Lugosi. Predictio, Learig, ad Games. Cambridge Uiversity Press, New York, NY, USA, 2006. [3] G.W. Corder ad D.I. Forema. Noparametric Statistics for No-Statisticias: A Step-by-Step Approach. New Jersey: Wiley, 2009. [4] J.V. Davis, B. Kulis, P. Jai, S. Sra, ad I.S. Dhillo. Iformatio-theoretic metric learig. I Proceedigs of the 24th iteratioal coferece o Machie Learig, 2007. [5] A. Globerso ad S. Roweis. Metric learig by collapsig classes. I Advaces i Neural Iformatio Processig Systems, 2005. [6] L. K. Saul ad S. T. Roweis. Thik globally, fit locally: Usupervised learig of low dimesioal maifolds. Joural of Machie Learig Research, 4, 2003. [7] Shai Shalev-Shwartz, Yoram Siger, ad Adrew Y. Ng. Olie ad batch learig of pseudometrics. I Proceedigs of the twety-first iteratioal coferece o Machie learig, pages 94 101, 2004. [8] N. Shetal, T. Hertz, D. Weishall, ad M. Pavel. Adjustmet learig ad relevat compoet aalysis. I Proceedigs of the Seveth Europea Coferece o Computer Visio, volume 4, pages 776 792, 2002. [9] Joatha R Shewchuk. A itroductio to the cojugate gradiet method without the agoizig pai. Techical report, Caregie Mello Uiversity, Pittsburgh, PA, USA, 1994. [10] J.B. Teebaum, V. de Silva, ad J. C. Lagford. A global geometric framework for oliear dimesioality reductio. Sciece, 290, 2000. [11] I.W. Tsag, P.M. Cheug, ad J.T. Kwok. Kerel relevat compoet aalysis for distace metric learig. I IEEE Iteratioal Joit Coferece o Neural Networks (IJCNN), 2005. [12] K. Weiberger, J. Blitzer, ad L. Saul. Distace metric learig for large margi earest eighbor classificatio. I Advaces i Neural Iformatio Processig Systems, 2005. [13] E.P. Xig, A.Y. Ng, M.I. Jorda, ad S. Russell. Distace metric learig, with applicatio to clusterig with side-iformatio. I Advaces i Neural Iformatio Processig Systems, 2002. [14] L. Yag ad R. Ji. Distace metric learig: A comprehesive survey. Michiga State Uiversity, Tech. Rep., 2006. [15] L. Yag, R. Ji, R. Sukthakar, ad Y. Liu. A efficiet algorithm for local distace metric learig. I the Proceedigs of the Twety-First Natioal Coferece o Artificial Itelligece Proceedigs (AAAI), 2006. 9