Cross-Domain Metric Learning Based on Information Theory

Transcription

1 Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Cross-Doain Metric Learning Based on Inforation Theory Hao Wang,2, Wei Wang 2,3, Chen Zhang 2, Fanjiang Xu 2. State Key Laboratory of Coputer Science 2. Science and Technology on Integrated Inforation Syste Laboratory Institute of Software, Chinese Acadey of Sciences, Beijing 0090, China 3. Departent of Autoation, University of Science and Technology of China Abstract Supervised etric learning plays a substantial role in statistical classification. Conventional etric learning algoriths have liited utility when the training data and testing data are drawn fro related but different doains (i.e., source doain and target doain). Although this issue has got soe progress in feature-based transfer learning, ost of the work in this area suffers fro non-trivial optiization and pays little attention to preserving the discriinating inforation. In this paper, we propose a novel etric learning algorith to transfer knowledge fro the source doain to the target doain in an inforation-theoretic setting, where a shared Mahalanobis distance across two doains is learnt by cobining three goals together: ) reducing the distribution difference between different doains; 2) preserving the geoetry of target doain data; 3) aligning the geoetry of source doain data with its label inforation. Based on this cobination, the learnt Mahalanobis distance effectively transfers the discriinating power and propagates standard classifiers across these two doains. More iportantly, our proposed ethod has closed-for solution and can be efficiently optiized. Experients in two real-world applications deonstrate the effectiveness of our proposed ethod. Introduction Distance etric learning is of fundaental iportance in achine learning. Previous research has deonstrated that appropriate distance etrics learnt fro labeled training data can greatly iprove classification accuracy (Jin, Wang, and Zhou 2009). Depending on whether the geoetry inforation is used, state-of-the-art supervised etric learning ethods can be classified into two categories, i.e., globality and locality. Globality etric learning ethods ai at keeping all the data points in the sae class close together for copactness while ensuring those fro different classes far apart for separability (Davis et al. 2007; Globerson and Roweis 2006; Wang and Jin 2009; Xing et al. 2002). Locality etric learning ethods incorporate the Corresponding author who ade ain idea and contribution to this work. Copyright c 204, Association for the Advanceent of Artificial Intelligence ( All rights reserved. geoetry of data with the label inforation to accoodate ultiodal data distributions and to further iprove classification perforance (Weinberger and Saul 2009; Yang et al. 2006). Existing etric learning ethods always perfor well when there are sufficient labeled training saples. However, in soe real-world applications, obtaining the label inforation of data points drawn fro the task-specific doain (i.e., target doain) is extreely expensive or even ipossible. One ay turn to find labeled data drawn fro a related but different doain (i.e., source doain) and apply it as prior knowledge. Apparently, distance etrics learnt only in source doain cannot be directly reused in target doain, although these two doains are closely related. It is because that the significant distribution difference between the data drawn fro source and target doains is not explicitly taken into considerations, and this difference will ake classifiers trained in source doain invalid in target doain. Therefore, it is iportant and necessary to reduce the distribution difference between labeled source doain data and unlabeled target doain data in distance etric learning. Recently, soe feature extraction approaches in transfer learning (Caruana 997; Pan and Yang 200) have been proposed to address this proble by iplicitly exploring a etric (siilarity) as a bridge for inforation transfer fro the source doain to the target doain (Geng, Tao, and Xu 20; Long et al. 203; Pan, Kowok, and Yang 2008; Pan et al. 20; Si, Tao, and Geng 200). These feature extraction ethods learn a shared feature representation across doains by ) reducing the distribution difference, 2) preserving the iportant properties (e.g., variance or geoetry) of data, especially the target doain data. However, ost work in this area does not focus on incorporating the geoetry with the label inforation of source doain data to iprove the classification perforance in target doain. Moreover, these ethods forulate a seidefinite prograing (SDP) (Boyd and Vandenberghe 2004) or a non-convex optiization proble, resulting in expensive coputation. In this paper, we address the transfer learning proble fro the etric learning view and propose a novel algorith naed Cross-Doain Metric Learning (CDML). Specifically, CDML first iniizes the distance between different distributions such that the arginal distributions of target doain and source doain data are close under the learnt distance etric. Second, two Gaussian distributions are con- 2099

2 structed, one based on the Mahalanobis distance to be learnt and the other based on the geoetry of target doain data. By iniizing the relative entropy between these two distributions, the geoetry of target doain data is preserved in the learnt distance etric. Third, another two Gaussian distributions are constructed, one based on the Mahalanobis distance to be learnt as well and the other based on the labels and the geoetry of source doain data. By iniizing the relative entropy between these two distributions, the learnt distance etric pulls the source doain data in the sae class close together, while pushing differently labeled data far apart. Finally, the three ters above are cobined into the unified loss function of CDML. This cobination effectively transfers the discriinating power gained fro the labeled source doain data to the unlabeled target doain data. To the best of our knowledge, our ethod has ade the first attept to cross-doain etric learning based on relative entropy. We ephasize that CDML has the closedfor solution, leading to efficient optiization. In suary, the contribution of this paper is two-fold. Fro the perspective of etric learning, we ai at addressing the challenge of distribution difference. Fro the perspective of transfer learning, a novel algorith is proposed to transfer knowledge by finding a shared Mahalanobis distance across doains. The optial etric can be found efficiently in closed-for. Under this optial etric, the data distributions are close and points fro different classes can be well separated. As a result, we can train standard classifiers in the source doain and reuse the to correctly classify the target doain data. Experiental results in realworld applications verify the effectiveness and efficiency of CDML copared with state-of-the-art etric learning ethods and transfer learning ethods. Metric Learning Related Work Significant efforts in etric learning have been spent on learning a Mahalanobis distance fro labeled training data for classification. Existing Mahalanobis distance learning ethods can be classified into two categories, i.e., globality and locality. A natural intention in globality learning is to forulates an SDP for keeping the sae labeled points siilar (i.e., the distances between the should be sall) and differently labeled points dissiilar (i.e., the distances should be larger) (Globerson and Roweis 2006; Xing et al. 2002). Other notable work in globality learning is based on inforation theory (Davis et al. 2007; Wang and Jin 2009). In particular, Inforation-Theoretic Metric Learning (ITML) (Davis et al. 2007) forulates the relative entropy as a Bregan optiization proble subject to linear constraints. Inforation Geoetry Metric Learning (IGML) (Wang and Jin 2009) iniizes the Kullback-Leibler (K-L) divergence between two Gaussian distributions and finds the closed-for solution. Locality etric learning ethods axially align the geoetry of data with its label inforation (Weinberger and Saul 2009; Yang et al. 2006) to further iprove their perforance. However, the supervised algoriths discussed above are liited by the underlying assuption that training data and testing data are drawn fro the sae distribution. Transfer Learning State-of-the-art transfer learning can be organized into instance reweighing (Dai et al. 2007a) and feature extraction. In the feature extraction category, recent work tries to find a subspace shared by both doains, such that the distribution difference is explicitly reduced and the iportant properties of original data are preserved (Geng, Tao, and Xu 20; Long et al. 203; Pan, Kowok, and Yang 2008; Si, Tao, and Geng 200). In this subspace, classifiers can be propagated between doains. Specifically, Maxiu Mean Discrepancy Ebedding (MMDE) (Pan, Kowok, and Yang 2008) eploys Maxiu Mean Discrepancy (MMD) (Gretton et al. 2006) to estiate the distance between different distributions and learns a kernel atrix by preserving the data variance at the sae tie. Joint Distribution Adaption (JDA) (Long et al. 203) extends MMD and constructs feature subspace by Principal Coponent Analysis (PCA) (Jolliffe 986). Transfer Subspace Learning (TSL) (Si, Tao, and Geng 200) integrates the Bregan divergence with soe diension reduction algoriths, e.g., PCA and Fisher s linear discriinant analysis (FLDA) (Fisher 936). However, these ethods forulate an SDP or a non-convex optiization, which has high coputational coplexity and requires iteratively updating paraeters. Even worse, the non-convex probles are prone to being trapped in local solutions. In coparison, our etric learning ethod has efficient closedfor solution and optially transfers the discriinating power. We would also like to ention that Transfer Coponent Analysis (TCA) (Pan et al. 20) is an efficient kernel learning ethod to extend MMDE. Our work differs fro TCA significantly in the proposed optiization. In this paper, an optial Mahalanobis distance is searched by utilizing the relationship between Gaussian distributions. Cross-Doain Metric Learning Based on Inforation Theory In this section, we present the proposed algorith naed Cross-Doain Metric Learning (CDML) in detail. Proble Definition We begin with the proble definition. Table lists the iportant notations used in this paper. Definition. (The Mahalanobis Distance) Denote x i, x j R d, and then the Mahalanobis distance between x i and x j is calculated as follows: d A (x i, x j ) = (x i x j ) T A(x i x j ), () where A R d d is positively sei-definite. In fact, there is a close link between Mahalanobis distance and linear transforation. If we define a linear projection W: W T W = A which aps x i to Wx i, the Euclidean distance between Wx and Wx 2, i.e., Wx Wx 2 2 = (x x 2 ) T W T W(x x 2 ) = (x x 2 ) T A(x x 2 ), is actually the Mahalanobis distance between x and x

3 Table : List of iportant notations used in this paper. Notation X src = {(x s, y), s..., (x s n, yn)} s X tar = {x t,..., x t } X = {x s,..., x s n, x t,..., x t } W A = W T W L K tar = [ Wx t i, Wxt j ] K T K src = [ Wx s i, Wxs j ] n n K S Description Source doain data set Target doain data set Input data set Linear transforation atrix Mahalanobis distance atrix The MMD atrix The linear kernel atrix for WX tar The ideal kernel atrix for X tar The linear kernel atrix for WX src The ideal kernel atrix for X src Proble. (Cross-Doain Metric Learning Based on Inforation Theory) Let X tar be a set of unlabeled testing saples drawn fro a target doain: X tar = {x t,..., x t }, where x t i R d. Let X src be a set of n labeled training saples drawn fro a related source doain: X src = {(x s, y), s..., (x s n, yn)}, s where x s i R d and yi s Y s is the class label. We denote P t (X tar ) and P s (X src ) as the arginal probability distributions of X tar and X src respectively, P t (X tar ) P s (X src ). Our task is to learn a shared etric distance A across doains under which ) the distribution difference between P s (X src ) and P t (X tar ) is explicitly reduced; 2) the geoetry of X tar is preserved; 3) the points fro X src with the sae label are kept siilar according to the geoetry and others are kept dissiilar. Miniizing Distribution Difference Conventional Mahalanobis distance learning ethods perfors well in the classification setting based on the assuption that training and testing points are drawn fro the sae distribution (i.e., P s (X src ) = P t (X tar )). When such a distance etric W c is learnt fro X src, it can iprove classification accuracy on X tar using standard classifiers such as KNN and SVM. However, P s (X src ) is usually different fro P t (X tar ) since X src and X tar are drawn fro different but related doains. In this case, P s (W c X src ) and P t (W c X tar ) are still significantly different and standard classification odels trained on W c X src cannot be directly applied on W c X tar. Therefore, it is necessary to find a etric W which can reduce the distance between different distributions. This issue is of particular iportance and gains its popularity in transfer learning. Inspired by the work (Long et al. 203; Pan, Kowok, and Yang 2008), we adopt the criterion Maxiu Mean Discrepancy (MMD) to easure the distance between P s (WX src ) and P t (WX tar ). The epirical estiate of MMD is as follows: n Wx s i Wx t i 2 = tr(xlx T A), (2) n i= i= where X = {x s,..., x s n, x t,..., x t } R d (n+), L R (n+) (n+) with: if x n i, x 2 j X src L(i, j) = if x i, x 2 j X tar (3) otherwise. n By iniizing Equation (2), P s (WX src ) and P t (WX tar ) are close to each other. Transferring Discriinating Power Based on Inforation Theory The etric distance W learnt by only iniizing the distribution difference ay erge all data points together, which is unsuitable for the classification task. To iprove classification accuracy, as stated in Proble, W should cobine iniizing the distribution difference with ) preserving the geoetry of X tar, 2) axially aligning the geoetry of X src with its label inforation. Based on this cobination, it is supposed that P s (Y s WX src ) P t (Y t WX tar ). W optially transfers discriinating power gained fro the source doain to the target doain, that is, the sae labeled points are kept close together and the differently labeled points are pushed far apart. In this way, if a classifier is trained on WX src and Y s, it can be reused to correctly classify WX tar. Note that the cobination can perfor well because X tar and X src share soe latent variables. Geoetry Preservation of X tar Preserving the geoetry of unlabeled X tar is particular useful for transfer learning (Long et al. 202; Wang and Mahadevan 20; Pan et al. 20). We construct a linear kernel K tar for WX tar : K tar = (WX tar) T (WX tar) = X T tarax tar. (4) To introduce the inforation theory into the space of positive definite atrices, K tar is related as the covariance atrix of a ultivariate Gaussian distribution with zero ean (Wang and Jin 2009): P r(z K tar) = (2π) /2 K tar /2 exp( zt K tarz/2), (5) where z R. In the ideal case, an ideal kernel atrix K T is expected to give a useful siilarity such that the geoetry of X tar is preserved. K T is related as the covariance atrix of another ultivariate Gaussian distribution: P r(z K T ) = (2π) /2 K T /2 exp( zt K T z/2), (6) where z R. The distance between K tar and K T, denoted as d(k tar K T ), can be derived by the K-L divergence between the two distributions in Equation (5) and (6): d(k tar K T ) = KL(P r(z K tar) P r(z K T )) = P r(z K P r(z Ktar)) tar)log P r(z K T ) dz. (7) Theore. The distance between K tar and K T in Equation (7) is equivalent to: d(k tar K T ) = 2 (tr(k T K tar) log K tar + log K T ). (8) To capture the inforation of K T, the optial A is searched by iniizing the distance d(k tar K T ) in Equation (8). Therefore, the geoetry of unlabeled X tar can be preserved in the learnt distance A: A = arg in A 0 d(ktar KT ) = arg in A 0 tr(k T X T tarax tar) log X T tarax tar. (9) 20

4 The reaining issue is to define the ideal kernel K T for geoetry preservation.. Constructing a k-nearest neighbor graph: let G t denote a directed graph containing a set of nodes V t nubered to and a set of edges E t. Two nodes i and j are connected by an edge (i.e., (i, j) E t ) if x t i is one of the k nearest neighbor of x t j. 2. Choosing weights: let M t refer to the adjacency atrix of G t, and it is given by: { M t exp( d ij ) if (i, j) E t (i, j) = 2σ 2 (0) 0 otherwise, where d ij = x t i xt j 2 and σ is the width. 3. Defining a kernel function K T on G t : specific kernel functions (Kondor and Lafferty 2002; Sola and Kondor 2003) on G t induced by the weights can give a useful and ore global sense of siilarity between instances. Let D t be an diagonal atrix with D t ii = j Mt ij. The Laplacian of G t is L t = D t M t, and the Noralized Laplacian is L t = (D t ) 2 L(D t ) 2. The eigenvalues and eigenvectors of L t are denoted as λ t i and φt i, i.e., L t = i λt i (φt i )(φt i )T. In this paper, we investigate the diffusion kernel (Kondor and Lafferty 2002) which is proven to be a generalization of Gaussian kernel to graphs: K T = exp( σd/2λ 2 t i)(φ t i)(φ t i) T, () i= where K T 0 since all the eigenvalues are positive (i.e., exp( σd 2/2λt i ) > 0). Label Inforation Utilization of X src A linear kernel K src is constructed for WX src : K src = (WX src ) T (WX src ) = X T srcax src. Label inforation is critical for classification tasks and encourages the siilarities between two points if and only if they belong to the sae class. Geoetry preservation is an iportant coponent for generalization ability (Weinberger and Saul 2009; Yang et al. 2006). By incorporating these two sources of inforation, an ideal kernel K S is defined for X src based on two idealizations: ) siilarities between points with different labels will be penalized; 2) siilarities between points in the sae class will be encouraged according to the neighborhood structure.. Constructing a within class graph: let G s denote a directed graph which consists of a set of nodes V s nubered to n and a set of edges E s. Two nodes i and j are connected by an edge (i.e., (i, j) E s ) if yi s = ys j. 2. Choosing the adjacency atrix M s of G s : M s (i, j) = exp( dij 2σ ) if (i, j) E s, otherwise M s (i, j) = Defining 2 a diffusion kernel function K S on G s : K S = n i= exp( σ2 d /2λs i )(φs i )(φs i )T, where (λ s i, φs i ) are eigenvalues and eigenvectors of the Noralized Laplacian. 4. Miniizing d(k src K S ): the optial A is searched by iniizing the distance d(k src K S ) derived fro Equation (8). Therefore, the learnt distance A axially aligns the geoetry of X src with its label inforation: A = arg in A 0 tr(k S X T srcax src) log X T srcax src. (2) The Cost Function CDML ais at searching the optial distance etric A by iniizing Equation (2), Equation (9) and Equation (2) siultaneously. This cobination effectively transfers the discriinating power gained fro the labeled source doain data to the unlabeled target doain data. The overall cost function is as follows: A = arg in tr(x(k + A 0 µl)xt A) log X T tarax tar (3) log X T srcax src, where µ > 0 is a tradeoff and K 0 = ( K S 0 0 K T Proposition. The (n + ) (n + ) atrix L in Equation(2) and Equation (3) is positive sei-definite. Proof. For any colun vector z R n+, we have z T Lz = ( a b ) ( ) ( ) P R a T R T Q b T ). (4) where a = (z,..., z n ), b = (z n+,..., z n+ ), P R n n with [P] ij = /n 2, Q R with [Q] ij = / 2 and R R n with [R] ij = /n. z T Lz in Equation (4) is equal to: apa T + bqb T + 2aRb T = n n i= j= z i z j n n + i= j= z n+i z n+j 2 n i= j= = ( z n z n n z n+... z n+ )2 0 Therefore, L 0. The proposition follows. z i z n+j n Based on Proposition, we can obtain the closed-for solution of CDML in the following proposition. Proposition 2. The optial solution to Equation (3) is: A = 2(X(K + µl)x T ) (5) Proof. The derivative of Equation (3) w.r.t. A is: X(K + µl)x T 2A. (6) Since K 0 and L 0, then (K + µl) 0. Proposition 2 now follows by setting the derivative to 0. Low Diensional Projections The Mahalanobis distance etric A learnt in CDML is of full rank. If A has the rank r < d, we can represent it in the for: A = W T r W r, where W r R r d projects the original data to an r-diensional space for diension reduction. To copute W r, a straightforward solution is to optiize Equation (3) with a constraint rank(a) = r. However, rank constraints on atrices are not convex (Boyd and Vandenberghe 2004). In this paper, the projection atrix W r is coputed by a substitute approach (Globerson and Roweis 2006) as follows: ) eigenvalues and eigenvectors of full-rank A in Equation (5) are calculated: A = d i= λ iu i u T i, where λ λ 2... λ d ; 2) W r = diag( λ,..., λ r )[u T ;... ; u T r ]. The eigen spectru of A usually rapidly decays and any eigenvalues are very sall, suggesting this solution is close to the optial one returned by iniizing the rank constrained optiization. 202

5 Experients In this section, we evaluate the proposed ethod in two etric learning related applications: ) face recognition and 2) text classification. Data Preparation Face Data Sets FERET (Phillips et al. 2000) and YALE (Belhueur, Hespanha, and Kriegan 997) are two public face data sets. FERET data set contains 3,539 face iages fro,565 individuals with different sizes, poses, illuinations and facial expressions. YALE data set has 65 iages fro 5 individuals with different expressions or configurations. Soe exaple face iages are shown in Figure. As in the previous work (Si, Tao, and Geng 200), we construct two cross-doain data sets: ) Y vs F: the source doain set is YALE, and the target doain set consists of 00 individuals randoly selected fro FERET. 2) F vs Y: the source set contains 00 individuals randoly selected fro FERET, and the target set is YALE. Figure : Iage exaples in (a) FERET data set and (b) YALE data set. Text Data Sets 20-Newsgroups and Reuters-2578 are two benchark text data sets widely used for evaluating the transfer learning algoriths (Dai et al. 2007b; Li, Jin, and Long 202; Pan et al. 20). 20-Newsgroups consists of nearly 20,000 docuents partitioned into 20 different subcategories. The corpus has four top categories and each top category has four subcategories as shown in Table 2. Following the work (Dai et al. 2007b), we construct six crossdoain data sets for binary text classification: cop vs rec, cop vs sci, cop vs talk, rec vs sci, rec vs talk and sci vs talk. Specifically, for each data set (e.g., cop vs rec), one top category (i.e., cop) is selected as the positive class and the other category (i.e., rec) is the negative class. Then two subcategories under the positive and the negative classes respectively are selected to for the source doain, the other two subcategories are used to for the target doain. Table 2: Top categories and their subcategories. Top Category Subcategory Exaples cop cop.graphics, cop.sys.ac.hardware, cop.os.s-windows.isc, cop.sys.ib.pc.hardware 3870 rec rec.autos, rec.otorcycles, rec.sport.baseball, rec.sport.hokey 3968 sci sci.crypt, sci.electronics, sci.ed, sci.space 3945 talk talk.politics.guns, talk.politics.ideast, talk.politics.isc, talk.religion.isc 3250 Reuters-2578 has three biggest top categories: orgs, people and places. The preprocessed version of Reuters-2578 on the web site ( is used which contains three cross-doain data sets: orgs vs people, orgs vs place and people vs place. Baseline Methods We systeatically copare CDML with three state-ofthe-art etric learning ethods, i.e., Inforation-Theoretic Metric Learning (ITML) (Davis et al. 2007); Inforation Geoetry Metric Learning (IGML) (Wang and Jin 2009); Large Margin Nearest Neighbor (LMNN) (Weinberger and Saul 2009); and three feature-based transfer learning ethods, i.e., Joint Distribution Adaption (JDA) (Long et al. 203); Seisupervised Transfer Coponent Analysis (SSTCA) (Pan et al. 20); Transferred Fisher s Linear Discriinant Analysis (TFLDA) (Si, Tao, and Geng 200); For the six coparison ethods, the paraeters spaces are epirically searched using their own optial paraeter settings and the best results are reported. CDML involves four paraeters: σ d, σ, µ and k. Specifically, we set σ d by searching the values aong {0.,, 0}, σ aong {0.,, 0} and µ aong {0.0, 0.,, 0}. The neighborhood size k for CDML is 3. In general, CDML is found to be robust to these paraeters. The experients are carried out on a single achine with Intel Core Ghz and 0 GB of RAM running 64-bit Windows 7. Experiental Results Results of Face Recognition In this section, we evaluate the ability of CDML to separate different classes in target doain. For Y vs F and F vs Y, one rando point for each target doain class is selected as the reference data set (Si, Tao, and Geng 200). The diensionality of each iage is reduced to 00 by PCA. All the ethods are trained as a etric learning procedure without the labels of target doain data. At the testing stage, the distance between a target point and every reference point is calculated using the learnt distance etric, then the label of the testing point is predicted as that of the nearest reference point. Since FERET and YALE has different class nubers, JDA is not suitable for this task which requires that source and target doain should share the sae class nuber. TFLDA can find at ost c eaningful diensions, where c is the class nuber of source doain. Figure 2 shows the classification error rates across different diensions. Soe observations can be concluded. The first general trend is that conventional etric learning algoriths (i.e., ITML, IGML and LMNN) show their liits on these cross-doain data sets. The etrics learnt only fro the source doain data fail to separate different classes in target doain. The second general trend is that SSTCA shows good classification perforance. SSTCA tries to learn a kernel atrix across doains such that the label dependence is axiized and the anifold structure is preserved. However, CDML consistently provides uch higher accuracy than SSTCA. A possible reason is that CDML focuses on keeping the data points in the sae class close together while ensuring those fro different classes far apart. The third general trend is that although TFLDA works 203

6 (a) (b) (c) Figure 2: Coparison of ITML, IGML, LMNN, SSTCA, TFLDA and CDML on the face data sets. (a) Classification error rates on Y-F data set. (b) Classification error rates on F-Y data set. (c) Running tie coparison. Method ITML IGML LMNN JDA SSTCA CDML Table 3: -NN classification errors (in percent) of the applied ethods. # Di orgs vs people orgs vs place people vs place Data Set cop vs rec cop vs sci cop vs talk rec vs sci rec vs talk sci vs talk quite well, it can just find at ost c eaningful diensions. By contrast, CDML alost achieves the optial error rate across all the diensions which illustrates its effective perforance in separating different target classes. To test the efficiency of CDML, we report the average training tie in Figure 2(c). ITML, LMNN and TFLDA are coputationally expensive since they forulate an alternative optiization proble. Even worse, TFLDA is nonconvex and ay be trapped in local solutions. Although IGML is fast due to the closed-for solution, it shows high classification error on these cross-doain data sets. We find CDML and SSTCA run quite efficiently, while CDML outperfors SSTCA in ters of classification accuracy. Results of Text Classification In this section, we evaluate the ability of CDML for text classification and a siple easureent is used: isclassification rate by -nearest neighbor classifier (-NN) without paraeters tuning. The unlabeled target instances are copared to the points in the labeled source doain using the learnt distance etric. We copare our proposed CDML with ITML, IGML, LMNN, JDA, SSTCA for this binary task. The classification results across different diensions are shown in Table 3. Soe advantages can be concluded fro the results. First, the results of non-transfer etric learning ethods are better than that of the transfer algoriths on cop vs rec and rec vs talk. A possible explanation is that on these two data sets, the distributions of source and target data are not significantly varied. But we would like to ention that the transfer ethods always perfor well on other cross-doain data sets. Second, JDA provides better results on people vs place and sci vs talk. The possible explanation is two-fold. ) Besides reducing the arginal distribution difference, the conditional distribution difference is also exploited in JDA. 2) The coon assuption in transferring learning that reducing the difference of arginal distributions will draw close the conditional distributions is not always valid. Third, CDML achieves the inial error rate on ost of the data sets, which illustrates the reliable and effective perforance of CDML for doain adaption. Conclusion In this paper, we have proposed a novel etric learning algorith to address transfer learning proble based on inforation theory. It learns a shared Mahalanobis distance across doains to transfer the discriinating power gained fro the source doain to the target doain. Based on the learnt distance, a standard classification odel trained only in the source doain can correctly classify the target doain data. Experients deonstrate the effectiveness of our proposed ethod. In future work, it is iportant and proising to explore an online algorith for cross-doain etric learning and the nonlinear version needs to be investigated. 204

7 Acknowledgents This work is supported by Natural Science Foundation of China (630364) and Beijing Natural Science Foundation (944037). References Belhueur, P. N.; Hespanha, J. P.; and Kriegan, D. J Eigenfaces versus fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 9(7): Boyd, S., and Vandenberghe, L Convex Optiization. Cabridge University Press, Cabridge. Caruana, R Multitask learning, Machine Learning 28():4-75. Dai, W.; Yang, Q.; Xue, G.; and Yu, Y Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning (ICML), Dai, W.; Xue, G.-R.; Yang, Q.; and Yu, Y Coclustering based classification for out-of-doain docuents. In Proceedings of the 3th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Davis, J. V.; Kulis, B.; Jain, P.; Sra, S.; and Dhillon, I. S Inforation-theoretic etric learning. In Proceedings of the 24th International Conference on Machine Learning (ICML), Fisher, R The use of ultiple easureents in taxonoic probles. Annals of Huan Genetics 7(2): Geng, B.; Tao, D.; and Xu, C. 20. DAML: Doain adaptation etric learning. IEEE Transactions on Iage Process 20(0): Globerson, A., and Roweis S Metric learning by collapsing classes. In Proceedings of the 20th Annual Conference on Advances in Neural Inforation Processing Systes (NIPS), Gretton, A.; Borgwardt, K. M.; Rasch, M. J.; Scholkopf, B.; and Sola, A. J A kernel ethod for the two-saple proble. In Proceedings of the 6th Annual Conference on Advances in Neural Inforation Processing Systes (NIPS). Jin, R.; Wang, S.; and Zhou, Y Regularized distance etric learning:theory and algorith. In Proceedings of the 23rd Annual Conference on Advances in Neural Inforation Processing Systes (NIPS), Jolliffe, I Principal Coponent Analysis. Springer- Verlag. Kondor, R. S., and Lafferty, J Diffusion kernels on graphs and other discrete input spaces. In Proceedings of the 9th International Conference on Machine Learning (ICML), Li, L.; Jin, X.; and Long, M Topic correlation analysis for cross-doain text classification. In Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI). Long, M.; Wang, J.; Ding, G.; Sun, J.; and Yu, P. S Transfer Feature Learning with Joint Distribution Adaptation. In Proceedings of the 4th IEEE International Conference on Coputer Vision (ICCV). Long, M.; Wang, J.; Ding, G.; Shen, D.; and Yang, Q Transfer learning with graph co-regularization. In Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI). Pan, S. J.; Kwok, J. T.; and Yang, Q Transfer learning via diensionality reduction. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI). Pan, S. J., and Yang, Q A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22: Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 20. Doain adaptation via transfer coponent analysis. IEEE Transactions on Neural Networks 22(2): Phillips, J. P.; Moon, H.; Rizvi, S. A.; and Rauss, P. J The FERET evaluation ethodology for face-recognition algoriths. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(0): Si, S.; Tao, D.; and Geng, B Bregan divergencebased regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering 22(7): Sola, A., and Kondor, R Kernels and regularization on graphs. In Proceedings of the 6th Annual Conference on Learning Theory (COLT), Wang, C., and Mahadevan, S. 20. Heterogeneous doain adaptation using anifold alignent. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI). Wang, S., and Jin, R An inforation geoetry approach for distance etric learning. In Proceedings of the 2nd International Conference on Artificial Intelligence and Statistics (AISTATS), Weinberger, K. Q.; Sha, F.; and Saul, L. K Learning a kernel atrix for nonlinear diensionality reduction. In Proceedings of the 2th International Conference on Machine Learning (ICML), Weinberger, K. Q., and Saul, L. K Distance etric learning for large argin nearest neighbor classification. Journal of Machine Learning Research 0: Xing, E. P.; Ng, A. Y.; Jordan, M. I.; and Russell, S. J Distance etric learning, with application to clustering with side-inforation. In Proceedings of the 6th Annual Conference on Advances in Neural Inforation Processing Systes (NIPS), Yang, L.; Jin, R.; Sukthankar, R.; and Liu, Y An efficient algorith for local distance etric learning. In Proceedings of the 2st AAAI Conference on Artificial Intelligence (AAAI),