Stable Learning in Coding Space for Multi-Class Decoding and Its Extension for Multi-Class Hypothesis Transfer Learning Bang Zhang, Yi Wang 2, Yang Wang, Fang Chen 2 National ICT Australia 2 School of Coputer Science and Engineering, The University of New South Wales, Australia bang.zhang,yi.wang,yang.wang,fang.chen@nicta.co.au Abstract Many prevalent ulti-class classification approaches can be unified and generalized by the output coding fraework [, 7] which usually consists of three phases: () coding, (2) learning binary classifiers, and (3) decoding. Most of these approaches focus on the first two phases and predefined distance function is used for decoding. In this paper, however, we propose to perfor learning in coding space for ore adaptive decoding, thereby iproving overall perforance. Rap loss is exploited for easuring ulti-class decoding error. The proposed algorith has unifor stability. It is insensitive to data noises and scalable with large scale datasets. Generalization error bound and nuerical results are given with proising outcoes.. Introduction Multi-class classification is a fundaental achine learning task as which any real-world probles can be casted. For instance, categorization of Internet resources, such as docuent, iage and video, plays a critical role for online inforation retrieval. Robot vision, auto-driving car and augented reality application require robust categorization odule for distinguishing different visual categories. Despite its popularity in real-world applications, ulticlass classification haeceived relatively less attention copared with its binary counterpart. Therefore, any state-of-the-art ulti-class classification approaches take advantage of the developents for binary proble by reducing a ulti-class proble into a collection of binary probles. For instance, one-versus-one and one-versusall [5] strategies are widely used in practice with good perforances. Output coding [, 7] is a general fraework that unifies NICTA is funded by the Australian Governent through the Departent of Counications and the Australian Research Council through the ICT Centre of Excellence Progra. and generalizes such type of approaches. It usually decoposes the learning process into three phases: () coding atrix design, which converts a ulti-class classification proble into a set of binary probles, (2) training binary classifiers, which tackles the converted binary probles, (3) decoding, which akes ulti-class classification decisions based on binary classification outputs with the aid of a decoding schee. Many efforts have been spent on either how to achieve optial coding [7, 7] or how to optiize coding and binary learning siultaneously [3, 23, 22]. Very few efforts have been spent on decoding phase. Pre-defined siilarity or distance functions are usually applied directly to ulticlass decoding. In this work, we propose a learning-based adaptive decoding ethod. There are several practical advantages for conducting proble-dependent learning in coding space. Firstly, learning-based adaptive decoding helps to refine coding for better generalization perforance. Most of the existing coding schees are pre-defined and only generate binary or ternary codes. It constrains the coding space in which later binary learning works. Hence, the learnability iestrained. The adaptive decoding presented in this work learns continuous codes which refine coding based on binary learning outcoes. Secondly, one of the ain drawbacks of pre-defined decoding schees is that binary classifiers are trained on different tasks separately. There is no guarantee that their realvalued outcoes are properly scaled for pre-defined decoding. Moreover, nonlinear relationships between binary classifiers are usually neglected [24], which can be crucial for better generalization perforance. As elaborated later, the proposed ethod can tackle these via learning and kernelization technique. Thirdly, learning in coding space in turn helps to iprove binary classifier. This is beneficial to tackling soe difficult achine learning probles. As an extension, we show that our learning-based decoding ethod benefits ulticlass hypothesis transfer learning (HTL) [0,, 9]. HTL
is a transfer learning fraework that only utilizes source hypotheses trained on a source doain for enhancing target doain learning. It assues that neither source doain exaples nor the relationship between target doain and source doain is accessible. Such settings iic real world situations in which legacy binary classifiers exist but the datasets on which they are trained have already becoe unavailable. By perforing learning in coding space, auxiliary source doain hypotheses can be readily incorporated into learning process. The knowledge can be transferred fro ultiple source doains to ultiple target doains by alternatively refining coding and target doain classifiers. Unlike previous HTL approaches, the proposed ethod considers the relationship between doains by adopting the output coding fraework. It helps to discover related target doain and source hypothesis. It also prevents negative transferring fro unrelated source hypothesis. Additional to the aforeentioned virtues, the proposed approach extendap loss [5, 4] to easure ulti-class decoding error. Unlike hinge loss which is prone to outliers, rap loss provides a clipped error easureent capping the linear penalty increent, the source of the outlier sensitivity. Besides, it helps to reduce the nuber of support vectors, which are critical to scalability. Unifor stability and a tight generalization bound are held by the proposed algorith. In order to optiize rap loss, a non-convex loss, ConCave-Convex Procedure (CCCP) [2] which is closely related to Difference of Convex (DC) prograing [8] is utilized. The rest of the paper is organized as follows. Related work ieviewed in Section 2. Section 3 describes the proposed ethod and gives its generalization error bound. The ulti-class HTL is given in Section 4. Experients are conducted in Section 5. Finally, the work is concluded by Section 6. 2. Related Work Output coding fraework [, 7] converts a ulti-class classification proble into a set of binary probles which can be readily solved by existing binary classification approaches. The quality of the coding atrix plays a critical role for the perforance of the final solution. In early works [], the error-correcting ability is the ain easureent for the quality of the coding atrix. In other words, the codewords for different classes need to be as distinguishable as possible for a good perforance. But later, it iealized that such coding strategy does not lead to superior perforance coparing with siple strategies, such aando-half, one-versus-one and one-versus-all. It is because the ost distinguishable coding atrix ay not be learnable for binary classification algoriths [3, 22]. The optial coding should be both distinguishable and learnable [25]. In order to find a balance between the, approaches have been developed for considering binary learning and coding siultaneously [3, 23, 22]. However, all of these approaches work on binary coding or ternary coding []. Decoding is achieved by siply applying a pre-defined siilarity function. The work in [8] provides a good suary for decoding ethods. As entioned before, the coding space is draatically constrained by current approaches. So in this paper, we propose to perfor learning in coding space with continuous codes. Several ethods [0,, 9] have been developed for HTL. Specifically, [9] developed a ulti-odel knowledge transfer approach (Multi-KT) based on least square SVM. It can properly select and weight prior knowledge coing fro different source doains. [0] proposed a ultiple kernel transfer learning algorith (MKTL). By adopting ultiple kernel learning fraework, it takes advantage of priors built over different features and with different learning ethods. [] developed a MULticlass Transfer Increental LEarning (MULTIpLE) ethod. It also adopted least square SVM. It seeks for a balance between transferring to the new class and preserving what has already been learned on the source odels. 3. Stable Learning for Multi-class Decoding We review the output coding fraework first. Let S = z i denote a set of training exaples, where z i = (x i, y i ), x i X is a feature vector and y i Y is a class label. We usually have X = R d and Y =,, k. The learning task is to learn a classifier: G(x) : X Y which can accurately assign a class label to an unseen exaple. In output coding fraework, the classifier G is obtained fro an enseble of binary classifiers H(x) = [h (x),, h l (x)] with the aid of a coding atrix M which has k rows and l coluns. Each row of the coding atrix, M y, represents a code word for the corresponding class y. Each colun of the coding atrix represents a binary partition over all the classes. It defines a binary classification proble over the training exaples, for which a binary classifier of the enseble is trained. The length of codes equals to the nuber of binary classifiers and is denoted by l. The decoding process perfors via a siilarity etric function f(h(x), M y ) : R l R l R. Then G can be represented as: y = arg ax y f(h(x), M y ). y,, k denotes a class label. The class whose code word is the closest to the enseble output is predicted as the label of the input. Traditional decoding process is conducted by directly applying a predefined siilarity or distance easure on binary classifiers outputs and code words. For exaple, Haing distance is widely adopted for ulti-class decoding: l M y,j h j (x) j= 2, where M y,j represents the eleent in y -th row and j-th colun of the coding atrix M.
Alternatively, we suggest that a proble-dependent decoding etric function can be learned fro exaples. For instance, a general inner product function can be used: f(h, M y ) = H T W y M y = l w j h j (x)m y,j, () j= where W y is a diagonal weighting atrix with w,, w l as its eleents. W y is the function paraeter that needs to be learned fro exaples. It is noted that, with each class y having a weighting atrix W y, the original coding atrix M is not necessary to be known for learning or applying such decoding function. The learning for decoding process can be regarded as to refine the coding atrix by learning an adaptive decoding atrix for the obtained binary classifier enseble without necessarily knowing the coding atrix. Thus, the decoding proble becoes to learn an adaptive decoding atrix W R k l for the decoding etric function; f(h, W y ) = H T W y, (2) where W y now represents the y -th row of W. It can be shown that such etric function is equivalent to Haing distance when the eleents of H and W are in, +. Other options for designing decoding etric function certainly exist and are worth for further study, e.g., Mahalanobis distance and anifold distance. However, an obvious and crucial advantage of using inner product is that it allows us to ake use of kernel trick readily. By doing so, the lack of the consideration of the nonlinear relationships between binary classifiers in traditional output coding fraework can be easily tackled. It provides us the possibility to design proper kernels for odeling the correlations of binary classifiers. It is also worth noting that, unlike traditional binary [] or ternary [8] coding atrix, the decoding atrix in Eq. 2 can have real-valued codes [6]. For binary coding, confusing classes, which can not be confidently colored into any of the two classes, often appears and degrades the perforance. Therefore, neutral value 0 is introduced into the partition schee of ternary coding to ignore confusing classes. Continuous coding goes one step further. It provides the flexibility to fine partition classes with weights. Besides, it helps to scale binary classifiers properly for decoding. For instance, despite its advantages, the outputs of the binary classifiers learned fro one-versus-all strategy are not guaranteed to be properly scaled. 3.. The Proposed Method We now describe the proposed decoding algorith. Given an enseble of binary classifiers H(x) = [h (x),, h l (x)] and a set of training exaples S = (x i, y i ), the goal is to design an algorith A which We keep using the sae sybols to denote training exaples. But it is worth noting that the data set used for learning a decoding etric function can differ fro the data set used for training binary classifiers. can learn an effective decoding etric function f(x, y ) : X Y R fro S: f(x, y ) = H(x) T W y = l W y,jh j (x). (3) j= Decoding atrix W R k l is the function paraeter need to be learned. The final decision is ade via G(x) : y = arg ax y f(x, y ). The argin of an exaple (x, y) for class y with respect to f can be defined as: ρ f (x, y ) = f(x, y) f(x, y ), (4) where y is the true label of x. It is noted that ρ f (x, y) = 0. We use ɛ(a S, z) to represent the loss of A S, which denotes the solution of A on S, with respect to an exaple z = x, y in the doain Z = X, Y. Such loss is usually easured by a loss function l(f, z) : F Z R +. It represents the loss of a decoding etric function f, which is generated by A on S, with respect to an exaple z. F is generally a linear space. As an estiation of generalization error, the epirical error of algorith A on S is usually easured by 0- loss: R ep(a, S) = ɛ(a S, z i) = l 0 (f, z i) = δ(g(x i) y i), where δ(π) equals to when the predicate π is true and 0 otherwise. Fro the study on classification and regression probles, we know that iniising 0- loss is coputational expensive due to its cobinatorial nature. Thus, with the consideration of a trade-off between statistical and nuerical properties of learning loss, a surrogate loss is usually adopted as a substitute of 0- loss. A popular surrogate loss is hinge loss l hinge (ρ) = ax(0, ρ) where ρ denotes argin. Although hinge loss has a nice nuerical property: convexity, which leads to efficient optiization, its statistical property has flaws. Its loss penalty is linear proportional to argin. This akes hinge loss prone to outliers. Moreover, hinge loss treats any training exaples within argin or isclassified as support vectors (SVs). This usually leads to a large nuber of SVs, which akes both training and testing expensive. It becoes worse when large scale data set is involved, since the nuber of SVs increases linearly with respect to the nuber of training exaples. In order to avert the drawbacks of hinge loss, we propose to extend rap loss [5] as a surrogate for easuring (5)
epirical ulti-class decoding error: l (s l, ) rap (f, z) = in( s l, ax y b y,y ρ f (x, y ) ) = l hinge (f, z) l hinge s l (f, z) = axb y y,y ρ f (x, y ) ax y s l b y,y ρ f (x, y ), where b y,y equals to 0 if y = y, otherwise, and siilarly s l by,y equals to 0 if y = y, s l otherwise. s l and represent left and right hinge points of rap losespectively. Its piece-wise for can be written as: B if ρ f (x, y ) s l l (s l, ) rap (f, z) = b y,y ρ f (x,y ) if s l ρ f (x, y ), 0 if ρ f (x, y ) (7) where B = s l is the axiu output value of l(s l, ) rap. With epirical error easured by the extended rap loss, the proposed algorith A can be defined as a regularized epirical error iniizer: A S = arg in f F l (s l,) rap (f, z i ) + λ 2 N(f) (8) λ = arg in W R k l 2 W 2 F + l hinge (f, z) Jconvex(W S ) l hinge s l (f, z), Jconcave(W S ) (9) where N(f) is a regularizer that easures the coplexity of f and F represents Frobenius nor. It is noted that the learning objective can be decoposed to the suation of a convex coponent J S convex(w ) and a concave coponent J S concave(w ). Thus, the optiization proble defined by Eq. 9 can be solved by an iterative learning procedure: Concave-Convex procedure (CCCP) [2] which is closely related to Difference of Convex (DC) prograing [8]. At each learning iteration, a convex upper bound can be found for the learning objective by replacing the concave coponent with its first order Taylor expansion at the current state. Then the learning proble reduces to a convex proble on which existing kernelization and optiization techniques can be applied. The learning procedure stops when convergence ieached. 3.2. Generalization Error Bound (6) Following the work in [3], we utilize concentration inequality (McDiarid inequality [4] specifically) and algorithic stability [3] to give a generalization error bound for the proposed decoding ethod. We keep using A S to denote the decoding etric that is learned by the proposed algorith A fro data set S. The generalization error is a rando variable depending on S. It can be written as: R(A, S) = E z [ɛ(a S, z)], (0) where ɛ(a S, z) denotes the loss of a decoding etric f with respect to an exaple z = x, y. However, the E z can not be calculated because the underlying distribution of z is unknown. Thus, epirical error on available data set S is usually utilized as an estiator for generalization error: R ep(a, S) = ɛ(a S, z i). () For the proposed ethod, ulti-clasap loss l (s l,) rap, as defined in Eq. 7, is used to easure ɛ(a S, z). The goal of our analysis here is to give a bound for the difference between R(A, S) and R ep (A, S), ore precisely, a bound for: P S [ R ep (A, S) R(A, S) > η], where η > 0. In the following, we briefly review the the analysis in [3] for generalization error bound given unifor stability. Then the unifor stability is shown to be hold by the proposed decoding algorith. Theore (McDiarid Inequality [4]) Given a set of exaples S = z i = (x i, y i ), a odified set Si defined as S i = z,, z i, z i, z i+,, z and a easurable function F : Z R, if the following condition is satisfied: sup S Z,z i Z F (S) F (S i ) c i, then P S [F (S) E S [F (S)] η] exp( 2η2 ). c2 i Definition (Unifor Stability [3]) An algorith A has unifor stability β if the following condition holds: S Z, i,,, sup ɛ(a S, ) ɛ(a S \i, ) β. S \i denotes the dataset which excludes the i-th eleent fro S. With theore and the unifor stability defined as the above, the following theore about generalization error bound has been given in [3]. Theore 2 (Exponential Bounds with Unifor Stability [3]) Given an algorith A having unifor stability β with respect to a loss function l which satisfies 0 l(a S, z) B for all sets S, then for any and any δ (0, ), the following bound holds with probability at least δ over the rando draw of the saple S: ln(/δ) R R ep + 2β + (4β + B) 2.
Now we derive the following theore about the unifor stability for rap loss. Theore 3 Given () a reproducing kernel Hilbert space E with kernel K : X X R satisfying x X, K(x, x) κ 2 <, (2) a loss function l(f, z) : F Z [0, ) which is onotonic and / -Lipschitz in its first arguent, (3) A is a syetric algorith generating a solution f S F on training set S by optiizing: f S = arg in f F R l ep(a, S) + λ 2 f 2 K, where R l ep = l(f, z i) and λ > 0, then A has unifor stability β() = The proof for theore 3 is deferred to suppleentary aterial. The direct application of theore 2 and theore 3 can provides us the following generalization error bound for the proposed decoding algorith. Exaple Given the conditions of theore 3, the following generalization error bound holds for algorith A: 8κ2 s 2 r λ. ( ) R(A, S) Rep rap (A, S) + 6κ2 32κ 2 ln /δ 2 λ + 2 λ + B 2. Reark The above gives a tight bound for the algorith defined by Eq. 2 since β scales as. Fro this bound, we can see that a larger leads to a faster learning rate. Siilar conclusions for structured estiation and clipped regularized risk iniizers are obtained in [4]. However, larger also leads to a denser classifier which has ore SVs [5]. For B = s l and s l <, we can see that larger s l leads to a saller B which indicates a better bound. It is possible to set s l in [0, ). It gives a saller B and a shorter SVs interval. However, iss-classified exaples will never becoe SVs. And this will degrade the generalization perforance. The later epirical study verifies this. Note that SVM with rap loss does not have the above bound because of the bias ite in its predict function. There is no explicit B for hinge loss to have the above bound. 4. Extension for Multi-class Hypothesis Transfer Learning Given target doain exaples S = z i and a collection of auxiliary source doain hypotheses ΔH(x) = [h l+,..., h l+nsh ], where n sh denotes the nuber of source hypotheses, the ai of the ulti-class HTL is to learn a set of binary hypotheses H(x) = [h (x),, h l (x)] in target doains and an adaptive decoding etric function f(x, y ) = H T W y, where H = [h (x),..., h l+nsh (x)], so that the classification perforance for target doains is iproved by transferring the knowledge in ΔH(x). It is noted that the decoding etric function f(x, y ) is now paraeterized by both W and H(x). If we assue H(x) is a set of binary hypotheses in the for of h j (x) = w j x + b j, where j [, l], then the extended ulti-class HTL algorith becoes: A S,ΔH = arg in f F l (s l, ) rap (f, z i) + λ 2 N(f) = arg in W R k (l+n sh ) l (s l, ),H rap (f, z i ) + λ l 2 ( W 2 F + w j 2 ). (2) j= The global optial solution is difficult to be obtained if it is not ipossible, since both W and H(x) need to be optiized. So we propose to adopt an alternative optiization strategy: (i) for fixed H(x) optiize W, (ii) for fixed W optiize H(x). The two steps alternate until convergence or the axiu nuber of iterations ieached. Although there is no guarantee that the global optiu could be achieved, each optiization step decreases the overall objective function. Step (i) can be readily solved by the proposed learningbased decoding ethod. For step (ii), siultaneously optiizing all the binary hypotheses is difficult. Inspired by rando coordinate descent approach, we propose to optiize binary hypotheses cyclically in a rando order until convergence or axiu nuber of iterations ieached. Any existing coding schee can be used for initializing the algorith. Target doain classifiers are firstly learned based on the initial coding. Learning-based decoding is then conducted with the support of auxiliary source hypotheses. Subsequently, target doain classifiers are updated. The alternative optiization continues until convergence or the axiu nuber of iterations ieached. 5. Experients We perfor 3 experients to evaluate the perforance of the proposed ethod. The first experient is conducted for coparing our learning-based decoding ethod with the state-of-the-art decoding approaches. The second experient is to evaluate the ipact of rap loss with different paraeter settings for verifying our theoretical analysis. The last experient is perfored for coparing our ulti-class HTL ethod with the state-of-the-art HTL algoriths. Unless otherwise entioned, the following ethodology is adopted for the experients. The odel paraeters are selected via cross-validation for the best generalization perforance. For the copared state-of-the-art techniques, default and optiized paraeters given by the authors are used. 0-fold cross-validation is applied to acquire average accuracy and standard deviation for perforance easureent.
Glass Vowel Balance Satiage Letter Vehicle HD SVM 55.75 ± 3.60 65.73 ± 2.62 85.57 ± 4.8 83.36 ± 2.02 85. ± 0.97 76.2 ± 2.40 Boost 66.69 ± 3.6 6.30 ± 2.67 78.97 ± 5.02 83.25 ± 2.23 88.3 ±.58 72.34 ± 3.34 PD SVM 57.04 ± 3.59 66.37 ± 2.69 83.36 ± 4.09 80.90 ±.74 77.82 ±.0 76.00 ±.45 Boost 64.35 ± 2.72 58.48 ± 3.02 82.22 ± 4.9 83.90 ±.82 87.65 ± 2.0 72.58 ± 2.95 LAP SVM 57.73 ± 3.4 68.40 ± 2.94 85.57 ± 4.8 83.36 ± 2.02 88.89 ±.05 76.2 ± 2.40 Boost 66.69 ± 3.6 65.36 ± 2.7 80.5 ± 4.0 84.5 ± 2.22 90.2 ±.8 72.34 ± 3.34 ELW SVM 59.30 ± 3.6 7.44 ± 3.6 85.57 ± 4.8 84.07 ± 2.00 89.44 ± 0.97 76.95 ±.76 Boost 66.69 ± 3.6 7.77 ± 3.02 77.55 ± 4.46 85.37 ±.87 9.92 ±.58 73.5 ± 3.6 Prpsd. SVM 63.06 ± 2.67 74.59 ± 2.39 88.76 ±.56 90.22 ±.02 95.32 ±.20 79.87 ±.49 Boost 68.89 ± 2.7 74.33 ± 2.78 83.09 ± 2.78 9.08 ± 2.27 96.77 ±.26 76.29 ± 2.80 Table. Average perforances (in percentage) of the state-of-the-art and the proposed decoding ethods on six ulti-class data sets fro UCI. The best results are shown in bold. 5.. Coparison with Previous Decoding Methods We firstly conduct the coparison experient between the state-of-the-art decoding ethods and the proposed learning-based decoding algorith. As shown in Table, six ulti-class datasets fro UCI Machine Learning Repository database [2] are used for the coparison. The copared decoding ethods include Haing decoding, probabilistic-based decoding [8], Laplacian decoding [8], exponential loss-weighted decoding [8] with continuous classifiers outputs. Siilar to the settings in [8], SVM and Adaboost (40 runs of decision stups) are used as binary classifiers. RBF kernel is used for the proposed decoding ethod. Seven popular coding schees are considered, including one-versus-one, one-versus-all, dense rando, sparse rando, DECOC [7] and ECOC-ONE [6]. For each decoding ethod, only the best coding result is reported. The coparison results are shown in Table. We can see that the proposed decoding ethod outperfors other decoding ethods. Large iproveents are gained for datasets Letter and Satiage which have a larger nuber of training exaples copared with the others. 5.2. The Ipact of Rap Loss Settings The second experient is conducted to evaluate the ipact of rap loss with different paraeter settings. The data used for this experients is 5-scene iage recognition data set [2]. There are 4485 iages in total. 00 iages are selected fro each category for training. The reaining 2985 iages are used for testing. 0-fold cross validation is applied. For visual description, we use spatial Principal coponent Analysis of Census Transfor histogras (spact) [20, 9, 22] as feature. The result is shown in Fig.. The left plot shows the changes of the nuber of SVs as a function of the nuber of training exaples. It can be seen that, for Hinge loss, naely Rap (,), the nuber of SVs increases linearly with the nuber of training exaples. It becoes sublinear when Rap loss is used. We can observe that better scalability (less SVs) is achieved when s l increases. Larger s l leads to a shorter SVs internal. Subsequently, the nuber Figure. Ipacts of s l on learning scalability and generalization perforance. of SVs is draatically reduced. It indicates that Rap loss give us the ability to control the scalability of the algorith by controlling the length of SVs interval. The right plot in Fig. shows the ipact of s l on both the generalization perforance and scalability. Two trends can be observed: () At the early stage, the nuber of SVs increases as s l decreases, and the generalization error ieduced correspondingly. However, the generalization error starts to increase when further decreent is put on s l after it reaches 0.9. (2) The nuber of SVs iapidly decreased when s l increases. But when becoes larger than 0.9, the generalization perforance also starts to drop rapidly. These trends verify our conjecture that the second hinge point s l gives us the flexibility to control the balance between generalization perforance and scalability. Depend on specific proble and its training data, an optial s l is possible to be obtained in ters of the optial balance between coputational cost and generalization perforance gain. 5.3. Multi-class HTL for Object Recognition The last experient is conducted on Caltech-256 dataset for verifying the proposed ulti-class HTL ethod. We use siilar experient setting as the one used in [0]. We choose 0 classes (bonsai, gorilla, horse, otorbikes, ushroo, segway, skateboard, skunk, snowobile, sunflower) as target classes. Maxiu 30 exaples are selected fro each class for training, and 50 exaple are used for testing. 20 classes (pal-tree, cactus, fern, hibiscus, bat, bear, leopards, zebra, dolphin, killer-whale, ountain-
The ethod is further extended for ulti-class HTL. Source doain hypotheses are exploited for leveraging target doain training via an alternative optiization process between coding space learning and target doain training. Experiental result shows that it outputs the state-of-theart ulti-class HTL approaches. References Figure 2. Recognition accuracy for Caltech-256 (0 target classes and 20 source classes) ulti-class object categorization. bike, fire-truck, car-side, bulldozer, cael, dog, raccoon, chip, school-bus and touring-bike) are used as source classes. PHOG shape descriptor, SIFT appearance descriptor, Region Covariance and Local Binary Patterns are used as visual features for representing exaples. One-versus-all strategy and SVM with RBF kernel are adopted for generating source doain hypotheses. For the proposed ethod, one-versus-all strategy is used for initializing target doain training. SVM with RBF kernel is used for training target doain classifiers. Balancing paraeter λ and RBF kernel paraeter γ are obtained by cross-validation. We copare our ethod with three state-of-the-art HTL approaches: MKTL [0], MultiKT [9] and MULTIpLE []. Since MultiKT is a binary transfer learning approach, one-versus-all strategy is adopted for it to achieve ulti-class HTL. Learning without knowledge transferring is also copared as a baseline. It adopts one-versus-rest strategy and only uses target doain exaples for training classifiers. The experient runs 0 ties. Training and testing exaples are randoly drawn fro each category for each run. Recognition accuracies are averaged over runs and classes. The result is shown in Fig. 2 with the nuber of training exaples per class increasing. Fro Fig. 2, we can see that all the HTL approaches perfor better than learning without knowledge transferring, and the proposed ethod consistently outperfors the other HTL approaches. The iproveent becoes larger when the nuber of training exaples per class increases. The reason is that, with ore target doain training exaples available, the proposed ethod iproves not only target doain classifier learning but also coding space learning. 6. Conclusion A learning-based decoding ethod is proposed for iproving ulti-class classification. It exploitap loss to easure ulti-class decoding error and refines coding with real-valued codes. Both theoretical analysis and nuerical results show its superiority over existing approaches. [] E. L. Allwein and et al. Reducing ulticlass to binary: A unifying approach for argin classifiers. JMLR, :3 4, 200., 2, 3 [2] A. Asuncion and D. Newan. Uci achine learning repository [http://www.ics.uci.edu/ learn/lrepository. htl]. university of california irvine. 2007. 6 [3] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499 526, 2002. 4 [4] O. Chapelle, C. Do, Q. Le, A. Sola, and C. Teo. Tighter bounds for structured estiation. NIPS, pages 28 288, 2008. 2, 5 [5] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In ICML, pages 20 208. ACM, 2006. 2, 3, 5 [6] K. Craer and Y. Singer. Iproved output coding for classification using continuouelaxation. NIPS, pages 437 443, 200. 3 [7] T. G. Dietterich and G. Bakiri. Solving ulticlass learning probles via error-correcting output codes. JAIR, 2(263):286, 995., 2 [8] S. Escalera and et al. On the decoding process in ternary errorcorrecting output codes. TPAMI, 32():20 34, 200. 2, 3, 6 [9] G. Heran and et al. Region-based iage categorization with reduced feature set. In MMSP, pages 586 59, 2008. 6 [0] L. Jie and et al. Multiclass transfer learning fro unconstrained priors. In ICCV, pages 863 870, 20., 2, 6, 7 [] I. Kuzborskij, F. Orabona, and B. Caputo. Fro n to n+ : Multiclass transfer increental learning. In CVPR, 203., 2, 7 [2] S. Lazebnik, C. Schid, and J. Ponce. Beyond bags of features: Spatial pyraid atching for recognizing natural scene categories. In CVPR, volue 2, pages 269 278, 2006. 6 [3] L. Li. Multiclass boosting with repartitioning. In ICML, pages 569 576. ACM, 2006., 2 [4] C. McDiarid. On the ethod of bounded differences. Surveys in cobinatorics, 4():48 88, 989. 4 [5] N. J. Nilsson. Learning achines. 965. [6] O. Pujol and et al. An increental node ebedding technique for error correcting output codes. PR, 4(2):73 725, 2008. 6 [7] O. Pujol, P. Radeva, and J. Vitria. Discriinant ecoc: A heuristic ethod for application dependent design of error correcting output codes. TPAMI, 28(6):007 02, 2006., 6 [8] P. D. Tao et al. The dc (difference of convex functions) prograing and dca revisited with dc odels of real world nonconvex optiization probles. Annals of Operations Research, 33(-4):23 46, 2005. 2, 4 [9] T. Toasi, F. Orabona, and B. Caputo. Safety in nubers: Learning categories fro few exaples with ulti odel knowledge transfer. In CVPR, pages 308 3088, 200., 2, 7 [20] J. Wu and J. M. Rehg. Where a i: Place instance and category recognition using spatial pact. In CVPR, pages 8, 2008. 6 [2] A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Coputation, 5(4):95 936, 2003. 2, 4 [22] B. Zhang and et al. Finding shareable inforative patterns and optial coding atrix for ulticlass boosting. In ICCV, pages 56 63, 2009., 2, 6 [23] B. Zhang and et al. Multi-class graph boosting with subgraph sharing for object recognition. In ICPR, pages 54 544, 200., 2 [24] B. Zhang and et al. Multilabel iage classification via high-order label correlation driven active learning. IEEE TIP, 23(3), 204. [25] Y. Zhang and J. Schneider. Maxiu argin output coding. ICML, 202. 2