Stable Learning in Coding Space for Multi-Class Decoding and Its Extension for Multi-Class Hypothesis Transfer Learning



Similar documents
Online Bagging and Boosting

Multi-Class Deep Boosting

Applying Multiple Neural Networks on Large Scale Data

Machine Learning Applications in Grid Computing

Binary Embedding: Fundamental Limits and Fast Algorithm

AUC Optimization vs. Error Rate Minimization

Cross-Domain Metric Learning Based on Information Theory

Sub-class Error-Correcting Output Codes

An Innovate Dynamic Load Balancing Algorithm Based on Task

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

ASIC Design Project Management Supported by Multi Agent Simulation

Use of extrapolation to forecast the working capital in the mechanical engineering companies

Searching strategy for multi-target discovery in wireless networks

On Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers

Software Quality Characteristics Tested For Mobile Application Development

Partitioned Elias-Fano Indexes

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

CLOSED-LOOP SUPPLY CHAIN NETWORK OPTIMIZATION FOR HONG KONG CARTRIDGE RECYCLING INDUSTRY

Online Appendix I: A Model of Household Bargaining with Violence. In this appendix I develop a simple model of household bargaining that

Adaptive Modulation and Coding for Unmanned Aerial Vehicle (UAV) Radio Channel

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

A Scalable Application Placement Controller for Enterprise Data Centers

arxiv: v1 [math.pr] 9 May 2008

Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

Image restoration for a rectangular poor-pixels detector

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

Data Streaming Algorithms for Estimating Entropy of Network Traffic

Bayes Point Machines

Preference-based Search and Multi-criteria Optimization

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks

Models and Algorithms for Stochastic Online Scheduling 1

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

Resource Allocation in Wireless Networks with Multiple Relays

Halloween Costume Ideas for the Wii Game

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

Dynamic Placement for Clustered Web Applications

Factored Models for Probabilistic Modal Logic

6. Time (or Space) Series Analysis

Information Processing Letters

Stochastic Online Scheduling on Parallel Machines

Managing Complex Network Operation with Predictive Analytics

International Journal of Management & Information Systems First Quarter 2012 Volume 16, Number 1

Online Community Detection for Large Complex Networks

Equivalent Tapped Delay Line Channel Responses with Reduced Taps

An improved TF-IDF approach for text classification *

Optimal Resource-Constraint Project Scheduling with Overlapping Modes

A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries

2. FINDING A SOLUTION

Botnets Detection Based on IRC-Community

High Performance Chinese/English Mixed OCR with Character Level Language Identification

Markovian inventory policy with application to the paper industry

Impact of Processing Costs on Service Chain Placement in Network Functions Virtualization

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA

Energy Proportionality for Disk Storage Using Replication

Support Vector Machine Soft Margin Classifiers: Error Analysis

Modeling operational risk data reported above a time-varying threshold

Considerations on Distributed Load Balancing for Fully Heterogeneous Machines: Two Particular Cases

Method of supply chain optimization in E-commerce

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing

How To Balance Over Redundant Wireless Sensor Networks Based On Diffluent

Data Set Generation for Rectangular Placement Problems

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

An Approach to Combating Free-riding in Peer-to-Peer Networks

PREDICTION OF POSSIBLE CONGESTIONS IN SLA CREATION PROCESS

Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Lecture L26-3D Rigid Body Dynamics: The Inertia Tensor

Energy Efficient VM Scheduling for Cloud Data Centers: Exact allocation and migration algorithms

A quantum secret ballot. Abstract

Quality evaluation of the model-based forecasts of implied volatility index

Simple and efficient online algorithms for real world applications

An Optimal Task Allocation Model for System Cost Analysis in Heterogeneous Distributed Computing Systems: A Heuristic Approach

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES

The individual neurons are complicated. They have a myriad of parts, subsystems and control mechanisms. They convey information via a host of

The Application of Bandwidth Optimization Technique in SLA Negotiation Process

Design of Model Reference Self Tuning Mechanism for PID like Fuzzy Controller

Efficient Key Management for Secure Group Communications with Bursty Behavior

ADJUSTING FOR QUALITY CHANGE

Construction Economics & Finance. Module 3 Lecture-1

An Integrated Approach for Monitoring Service Level Parameters of Software-Defined Networking

Online Methods for Multi-Domain Learning and Adaptation

Approximately-Perfect Hashing: Improving Network Throughput through Efficient Off-chip Routing Table Lookup

Enrolment into Higher Education and Changes in Repayment Obligations of Student Aid Microeconometric Evidence for Germany

Calculating the Return on Investment (ROI) for DMSMS Management. The Problem with Cost Avoidance

A Study on the Chain Restaurants Dynamic Negotiation Games of the Optimization of Joint Procurement of Food Materials

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Fuzzy Sets in HR Management

Load Control for Overloaded MPLS/DiffServ Networks during SLA Negotiation

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

Evaluating Software Quality of Vendors using Fuzzy Analytic Hierarchy Process

Work Travel and Decision Probling in the Network Marketing World

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

Transcription:

Stable Learning in Coding Space for Multi-Class Decoding and Its Extension for Multi-Class Hypothesis Transfer Learning Bang Zhang, Yi Wang 2, Yang Wang, Fang Chen 2 National ICT Australia 2 School of Coputer Science and Engineering, The University of New South Wales, Australia bang.zhang,yi.wang,yang.wang,fang.chen@nicta.co.au Abstract Many prevalent ulti-class classification approaches can be unified and generalized by the output coding fraework [, 7] which usually consists of three phases: () coding, (2) learning binary classifiers, and (3) decoding. Most of these approaches focus on the first two phases and predefined distance function is used for decoding. In this paper, however, we propose to perfor learning in coding space for ore adaptive decoding, thereby iproving overall perforance. Rap loss is exploited for easuring ulti-class decoding error. The proposed algorith has unifor stability. It is insensitive to data noises and scalable with large scale datasets. Generalization error bound and nuerical results are given with proising outcoes.. Introduction Multi-class classification is a fundaental achine learning task as which any real-world probles can be casted. For instance, categorization of Internet resources, such as docuent, iage and video, plays a critical role for online inforation retrieval. Robot vision, auto-driving car and augented reality application require robust categorization odule for distinguishing different visual categories. Despite its popularity in real-world applications, ulticlass classification haeceived relatively less attention copared with its binary counterpart. Therefore, any state-of-the-art ulti-class classification approaches take advantage of the developents for binary proble by reducing a ulti-class proble into a collection of binary probles. For instance, one-versus-one and one-versusall [5] strategies are widely used in practice with good perforances. Output coding [, 7] is a general fraework that unifies NICTA is funded by the Australian Governent through the Departent of Counications and the Australian Research Council through the ICT Centre of Excellence Progra. and generalizes such type of approaches. It usually decoposes the learning process into three phases: () coding atrix design, which converts a ulti-class classification proble into a set of binary probles, (2) training binary classifiers, which tackles the converted binary probles, (3) decoding, which akes ulti-class classification decisions based on binary classification outputs with the aid of a decoding schee. Many efforts have been spent on either how to achieve optial coding [7, 7] or how to optiize coding and binary learning siultaneously [3, 23, 22]. Very few efforts have been spent on decoding phase. Pre-defined siilarity or distance functions are usually applied directly to ulticlass decoding. In this work, we propose a learning-based adaptive decoding ethod. There are several practical advantages for conducting proble-dependent learning in coding space. Firstly, learning-based adaptive decoding helps to refine coding for better generalization perforance. Most of the existing coding schees are pre-defined and only generate binary or ternary codes. It constrains the coding space in which later binary learning works. Hence, the learnability iestrained. The adaptive decoding presented in this work learns continuous codes which refine coding based on binary learning outcoes. Secondly, one of the ain drawbacks of pre-defined decoding schees is that binary classifiers are trained on different tasks separately. There is no guarantee that their realvalued outcoes are properly scaled for pre-defined decoding. Moreover, nonlinear relationships between binary classifiers are usually neglected [24], which can be crucial for better generalization perforance. As elaborated later, the proposed ethod can tackle these via learning and kernelization technique. Thirdly, learning in coding space in turn helps to iprove binary classifier. This is beneficial to tackling soe difficult achine learning probles. As an extension, we show that our learning-based decoding ethod benefits ulticlass hypothesis transfer learning (HTL) [0,, 9]. HTL

is a transfer learning fraework that only utilizes source hypotheses trained on a source doain for enhancing target doain learning. It assues that neither source doain exaples nor the relationship between target doain and source doain is accessible. Such settings iic real world situations in which legacy binary classifiers exist but the datasets on which they are trained have already becoe unavailable. By perforing learning in coding space, auxiliary source doain hypotheses can be readily incorporated into learning process. The knowledge can be transferred fro ultiple source doains to ultiple target doains by alternatively refining coding and target doain classifiers. Unlike previous HTL approaches, the proposed ethod considers the relationship between doains by adopting the output coding fraework. It helps to discover related target doain and source hypothesis. It also prevents negative transferring fro unrelated source hypothesis. Additional to the aforeentioned virtues, the proposed approach extendap loss [5, 4] to easure ulti-class decoding error. Unlike hinge loss which is prone to outliers, rap loss provides a clipped error easureent capping the linear penalty increent, the source of the outlier sensitivity. Besides, it helps to reduce the nuber of support vectors, which are critical to scalability. Unifor stability and a tight generalization bound are held by the proposed algorith. In order to optiize rap loss, a non-convex loss, ConCave-Convex Procedure (CCCP) [2] which is closely related to Difference of Convex (DC) prograing [8] is utilized. The rest of the paper is organized as follows. Related work ieviewed in Section 2. Section 3 describes the proposed ethod and gives its generalization error bound. The ulti-class HTL is given in Section 4. Experients are conducted in Section 5. Finally, the work is concluded by Section 6. 2. Related Work Output coding fraework [, 7] converts a ulti-class classification proble into a set of binary probles which can be readily solved by existing binary classification approaches. The quality of the coding atrix plays a critical role for the perforance of the final solution. In early works [], the error-correcting ability is the ain easureent for the quality of the coding atrix. In other words, the codewords for different classes need to be as distinguishable as possible for a good perforance. But later, it iealized that such coding strategy does not lead to superior perforance coparing with siple strategies, such aando-half, one-versus-one and one-versus-all. It is because the ost distinguishable coding atrix ay not be learnable for binary classification algoriths [3, 22]. The optial coding should be both distinguishable and learnable [25]. In order to find a balance between the, approaches have been developed for considering binary learning and coding siultaneously [3, 23, 22]. However, all of these approaches work on binary coding or ternary coding []. Decoding is achieved by siply applying a pre-defined siilarity function. The work in [8] provides a good suary for decoding ethods. As entioned before, the coding space is draatically constrained by current approaches. So in this paper, we propose to perfor learning in coding space with continuous codes. Several ethods [0,, 9] have been developed for HTL. Specifically, [9] developed a ulti-odel knowledge transfer approach (Multi-KT) based on least square SVM. It can properly select and weight prior knowledge coing fro different source doains. [0] proposed a ultiple kernel transfer learning algorith (MKTL). By adopting ultiple kernel learning fraework, it takes advantage of priors built over different features and with different learning ethods. [] developed a MULticlass Transfer Increental LEarning (MULTIpLE) ethod. It also adopted least square SVM. It seeks for a balance between transferring to the new class and preserving what has already been learned on the source odels. 3. Stable Learning for Multi-class Decoding We review the output coding fraework first. Let S = z i denote a set of training exaples, where z i = (x i, y i ), x i X is a feature vector and y i Y is a class label. We usually have X = R d and Y =,, k. The learning task is to learn a classifier: G(x) : X Y which can accurately assign a class label to an unseen exaple. In output coding fraework, the classifier G is obtained fro an enseble of binary classifiers H(x) = [h (x),, h l (x)] with the aid of a coding atrix M which has k rows and l coluns. Each row of the coding atrix, M y, represents a code word for the corresponding class y. Each colun of the coding atrix represents a binary partition over all the classes. It defines a binary classification proble over the training exaples, for which a binary classifier of the enseble is trained. The length of codes equals to the nuber of binary classifiers and is denoted by l. The decoding process perfors via a siilarity etric function f(h(x), M y ) : R l R l R. Then G can be represented as: y = arg ax y f(h(x), M y ). y,, k denotes a class label. The class whose code word is the closest to the enseble output is predicted as the label of the input. Traditional decoding process is conducted by directly applying a predefined siilarity or distance easure on binary classifiers outputs and code words. For exaple, Haing distance is widely adopted for ulti-class decoding: l M y,j h j (x) j= 2, where M y,j represents the eleent in y -th row and j-th colun of the coding atrix M.

Alternatively, we suggest that a proble-dependent decoding etric function can be learned fro exaples. For instance, a general inner product function can be used: f(h, M y ) = H T W y M y = l w j h j (x)m y,j, () j= where W y is a diagonal weighting atrix with w,, w l as its eleents. W y is the function paraeter that needs to be learned fro exaples. It is noted that, with each class y having a weighting atrix W y, the original coding atrix M is not necessary to be known for learning or applying such decoding function. The learning for decoding process can be regarded as to refine the coding atrix by learning an adaptive decoding atrix for the obtained binary classifier enseble without necessarily knowing the coding atrix. Thus, the decoding proble becoes to learn an adaptive decoding atrix W R k l for the decoding etric function; f(h, W y ) = H T W y, (2) where W y now represents the y -th row of W. It can be shown that such etric function is equivalent to Haing distance when the eleents of H and W are in, +. Other options for designing decoding etric function certainly exist and are worth for further study, e.g., Mahalanobis distance and anifold distance. However, an obvious and crucial advantage of using inner product is that it allows us to ake use of kernel trick readily. By doing so, the lack of the consideration of the nonlinear relationships between binary classifiers in traditional output coding fraework can be easily tackled. It provides us the possibility to design proper kernels for odeling the correlations of binary classifiers. It is also worth noting that, unlike traditional binary [] or ternary [8] coding atrix, the decoding atrix in Eq. 2 can have real-valued codes [6]. For binary coding, confusing classes, which can not be confidently colored into any of the two classes, often appears and degrades the perforance. Therefore, neutral value 0 is introduced into the partition schee of ternary coding to ignore confusing classes. Continuous coding goes one step further. It provides the flexibility to fine partition classes with weights. Besides, it helps to scale binary classifiers properly for decoding. For instance, despite its advantages, the outputs of the binary classifiers learned fro one-versus-all strategy are not guaranteed to be properly scaled. 3.. The Proposed Method We now describe the proposed decoding algorith. Given an enseble of binary classifiers H(x) = [h (x),, h l (x)] and a set of training exaples S = (x i, y i ), the goal is to design an algorith A which We keep using the sae sybols to denote training exaples. But it is worth noting that the data set used for learning a decoding etric function can differ fro the data set used for training binary classifiers. can learn an effective decoding etric function f(x, y ) : X Y R fro S: f(x, y ) = H(x) T W y = l W y,jh j (x). (3) j= Decoding atrix W R k l is the function paraeter need to be learned. The final decision is ade via G(x) : y = arg ax y f(x, y ). The argin of an exaple (x, y) for class y with respect to f can be defined as: ρ f (x, y ) = f(x, y) f(x, y ), (4) where y is the true label of x. It is noted that ρ f (x, y) = 0. We use ɛ(a S, z) to represent the loss of A S, which denotes the solution of A on S, with respect to an exaple z = x, y in the doain Z = X, Y. Such loss is usually easured by a loss function l(f, z) : F Z R +. It represents the loss of a decoding etric function f, which is generated by A on S, with respect to an exaple z. F is generally a linear space. As an estiation of generalization error, the epirical error of algorith A on S is usually easured by 0- loss: R ep(a, S) = ɛ(a S, z i) = l 0 (f, z i) = δ(g(x i) y i), where δ(π) equals to when the predicate π is true and 0 otherwise. Fro the study on classification and regression probles, we know that iniising 0- loss is coputational expensive due to its cobinatorial nature. Thus, with the consideration of a trade-off between statistical and nuerical properties of learning loss, a surrogate loss is usually adopted as a substitute of 0- loss. A popular surrogate loss is hinge loss l hinge (ρ) = ax(0, ρ) where ρ denotes argin. Although hinge loss has a nice nuerical property: convexity, which leads to efficient optiization, its statistical property has flaws. Its loss penalty is linear proportional to argin. This akes hinge loss prone to outliers. Moreover, hinge loss treats any training exaples within argin or isclassified as support vectors (SVs). This usually leads to a large nuber of SVs, which akes both training and testing expensive. It becoes worse when large scale data set is involved, since the nuber of SVs increases linearly with respect to the nuber of training exaples. In order to avert the drawbacks of hinge loss, we propose to extend rap loss [5] as a surrogate for easuring (5)

epirical ulti-class decoding error: l (s l, ) rap (f, z) = in( s l, ax y b y,y ρ f (x, y ) ) = l hinge (f, z) l hinge s l (f, z) = axb y y,y ρ f (x, y ) ax y s l b y,y ρ f (x, y ), where b y,y equals to 0 if y = y, otherwise, and siilarly s l by,y equals to 0 if y = y, s l otherwise. s l and represent left and right hinge points of rap losespectively. Its piece-wise for can be written as: B if ρ f (x, y ) s l l (s l, ) rap (f, z) = b y,y ρ f (x,y ) if s l ρ f (x, y ), 0 if ρ f (x, y ) (7) where B = s l is the axiu output value of l(s l, ) rap. With epirical error easured by the extended rap loss, the proposed algorith A can be defined as a regularized epirical error iniizer: A S = arg in f F l (s l,) rap (f, z i ) + λ 2 N(f) (8) λ = arg in W R k l 2 W 2 F + l hinge (f, z) Jconvex(W S ) l hinge s l (f, z), Jconcave(W S ) (9) where N(f) is a regularizer that easures the coplexity of f and F represents Frobenius nor. It is noted that the learning objective can be decoposed to the suation of a convex coponent J S convex(w ) and a concave coponent J S concave(w ). Thus, the optiization proble defined by Eq. 9 can be solved by an iterative learning procedure: Concave-Convex procedure (CCCP) [2] which is closely related to Difference of Convex (DC) prograing [8]. At each learning iteration, a convex upper bound can be found for the learning objective by replacing the concave coponent with its first order Taylor expansion at the current state. Then the learning proble reduces to a convex proble on which existing kernelization and optiization techniques can be applied. The learning procedure stops when convergence ieached. 3.2. Generalization Error Bound (6) Following the work in [3], we utilize concentration inequality (McDiarid inequality [4] specifically) and algorithic stability [3] to give a generalization error bound for the proposed decoding ethod. We keep using A S to denote the decoding etric that is learned by the proposed algorith A fro data set S. The generalization error is a rando variable depending on S. It can be written as: R(A, S) = E z [ɛ(a S, z)], (0) where ɛ(a S, z) denotes the loss of a decoding etric f with respect to an exaple z = x, y. However, the E z can not be calculated because the underlying distribution of z is unknown. Thus, epirical error on available data set S is usually utilized as an estiator for generalization error: R ep(a, S) = ɛ(a S, z i). () For the proposed ethod, ulti-clasap loss l (s l,) rap, as defined in Eq. 7, is used to easure ɛ(a S, z). The goal of our analysis here is to give a bound for the difference between R(A, S) and R ep (A, S), ore precisely, a bound for: P S [ R ep (A, S) R(A, S) > η], where η > 0. In the following, we briefly review the the analysis in [3] for generalization error bound given unifor stability. Then the unifor stability is shown to be hold by the proposed decoding algorith. Theore (McDiarid Inequality [4]) Given a set of exaples S = z i = (x i, y i ), a odified set Si defined as S i = z,, z i, z i, z i+,, z and a easurable function F : Z R, if the following condition is satisfied: sup S Z,z i Z F (S) F (S i ) c i, then P S [F (S) E S [F (S)] η] exp( 2η2 ). c2 i Definition (Unifor Stability [3]) An algorith A has unifor stability β if the following condition holds: S Z, i,,, sup ɛ(a S, ) ɛ(a S \i, ) β. S \i denotes the dataset which excludes the i-th eleent fro S. With theore and the unifor stability defined as the above, the following theore about generalization error bound has been given in [3]. Theore 2 (Exponential Bounds with Unifor Stability [3]) Given an algorith A having unifor stability β with respect to a loss function l which satisfies 0 l(a S, z) B for all sets S, then for any and any δ (0, ), the following bound holds with probability at least δ over the rando draw of the saple S: ln(/δ) R R ep + 2β + (4β + B) 2.

Now we derive the following theore about the unifor stability for rap loss. Theore 3 Given () a reproducing kernel Hilbert space E with kernel K : X X R satisfying x X, K(x, x) κ 2 <, (2) a loss function l(f, z) : F Z [0, ) which is onotonic and / -Lipschitz in its first arguent, (3) A is a syetric algorith generating a solution f S F on training set S by optiizing: f S = arg in f F R l ep(a, S) + λ 2 f 2 K, where R l ep = l(f, z i) and λ > 0, then A has unifor stability β() = The proof for theore 3 is deferred to suppleentary aterial. The direct application of theore 2 and theore 3 can provides us the following generalization error bound for the proposed decoding algorith. Exaple Given the conditions of theore 3, the following generalization error bound holds for algorith A: 8κ2 s 2 r λ. ( ) R(A, S) Rep rap (A, S) + 6κ2 32κ 2 ln /δ 2 λ + 2 λ + B 2. Reark The above gives a tight bound for the algorith defined by Eq. 2 since β scales as. Fro this bound, we can see that a larger leads to a faster learning rate. Siilar conclusions for structured estiation and clipped regularized risk iniizers are obtained in [4]. However, larger also leads to a denser classifier which has ore SVs [5]. For B = s l and s l <, we can see that larger s l leads to a saller B which indicates a better bound. It is possible to set s l in [0, ). It gives a saller B and a shorter SVs interval. However, iss-classified exaples will never becoe SVs. And this will degrade the generalization perforance. The later epirical study verifies this. Note that SVM with rap loss does not have the above bound because of the bias ite in its predict function. There is no explicit B for hinge loss to have the above bound. 4. Extension for Multi-class Hypothesis Transfer Learning Given target doain exaples S = z i and a collection of auxiliary source doain hypotheses ΔH(x) = [h l+,..., h l+nsh ], where n sh denotes the nuber of source hypotheses, the ai of the ulti-class HTL is to learn a set of binary hypotheses H(x) = [h (x),, h l (x)] in target doains and an adaptive decoding etric function f(x, y ) = H T W y, where H = [h (x),..., h l+nsh (x)], so that the classification perforance for target doains is iproved by transferring the knowledge in ΔH(x). It is noted that the decoding etric function f(x, y ) is now paraeterized by both W and H(x). If we assue H(x) is a set of binary hypotheses in the for of h j (x) = w j x + b j, where j [, l], then the extended ulti-class HTL algorith becoes: A S,ΔH = arg in f F l (s l, ) rap (f, z i) + λ 2 N(f) = arg in W R k (l+n sh ) l (s l, ),H rap (f, z i ) + λ l 2 ( W 2 F + w j 2 ). (2) j= The global optial solution is difficult to be obtained if it is not ipossible, since both W and H(x) need to be optiized. So we propose to adopt an alternative optiization strategy: (i) for fixed H(x) optiize W, (ii) for fixed W optiize H(x). The two steps alternate until convergence or the axiu nuber of iterations ieached. Although there is no guarantee that the global optiu could be achieved, each optiization step decreases the overall objective function. Step (i) can be readily solved by the proposed learningbased decoding ethod. For step (ii), siultaneously optiizing all the binary hypotheses is difficult. Inspired by rando coordinate descent approach, we propose to optiize binary hypotheses cyclically in a rando order until convergence or axiu nuber of iterations ieached. Any existing coding schee can be used for initializing the algorith. Target doain classifiers are firstly learned based on the initial coding. Learning-based decoding is then conducted with the support of auxiliary source hypotheses. Subsequently, target doain classifiers are updated. The alternative optiization continues until convergence or the axiu nuber of iterations ieached. 5. Experients We perfor 3 experients to evaluate the perforance of the proposed ethod. The first experient is conducted for coparing our learning-based decoding ethod with the state-of-the-art decoding approaches. The second experient is to evaluate the ipact of rap loss with different paraeter settings for verifying our theoretical analysis. The last experient is perfored for coparing our ulti-class HTL ethod with the state-of-the-art HTL algoriths. Unless otherwise entioned, the following ethodology is adopted for the experients. The odel paraeters are selected via cross-validation for the best generalization perforance. For the copared state-of-the-art techniques, default and optiized paraeters given by the authors are used. 0-fold cross-validation is applied to acquire average accuracy and standard deviation for perforance easureent.

Glass Vowel Balance Satiage Letter Vehicle HD SVM 55.75 ± 3.60 65.73 ± 2.62 85.57 ± 4.8 83.36 ± 2.02 85. ± 0.97 76.2 ± 2.40 Boost 66.69 ± 3.6 6.30 ± 2.67 78.97 ± 5.02 83.25 ± 2.23 88.3 ±.58 72.34 ± 3.34 PD SVM 57.04 ± 3.59 66.37 ± 2.69 83.36 ± 4.09 80.90 ±.74 77.82 ±.0 76.00 ±.45 Boost 64.35 ± 2.72 58.48 ± 3.02 82.22 ± 4.9 83.90 ±.82 87.65 ± 2.0 72.58 ± 2.95 LAP SVM 57.73 ± 3.4 68.40 ± 2.94 85.57 ± 4.8 83.36 ± 2.02 88.89 ±.05 76.2 ± 2.40 Boost 66.69 ± 3.6 65.36 ± 2.7 80.5 ± 4.0 84.5 ± 2.22 90.2 ±.8 72.34 ± 3.34 ELW SVM 59.30 ± 3.6 7.44 ± 3.6 85.57 ± 4.8 84.07 ± 2.00 89.44 ± 0.97 76.95 ±.76 Boost 66.69 ± 3.6 7.77 ± 3.02 77.55 ± 4.46 85.37 ±.87 9.92 ±.58 73.5 ± 3.6 Prpsd. SVM 63.06 ± 2.67 74.59 ± 2.39 88.76 ±.56 90.22 ±.02 95.32 ±.20 79.87 ±.49 Boost 68.89 ± 2.7 74.33 ± 2.78 83.09 ± 2.78 9.08 ± 2.27 96.77 ±.26 76.29 ± 2.80 Table. Average perforances (in percentage) of the state-of-the-art and the proposed decoding ethods on six ulti-class data sets fro UCI. The best results are shown in bold. 5.. Coparison with Previous Decoding Methods We firstly conduct the coparison experient between the state-of-the-art decoding ethods and the proposed learning-based decoding algorith. As shown in Table, six ulti-class datasets fro UCI Machine Learning Repository database [2] are used for the coparison. The copared decoding ethods include Haing decoding, probabilistic-based decoding [8], Laplacian decoding [8], exponential loss-weighted decoding [8] with continuous classifiers outputs. Siilar to the settings in [8], SVM and Adaboost (40 runs of decision stups) are used as binary classifiers. RBF kernel is used for the proposed decoding ethod. Seven popular coding schees are considered, including one-versus-one, one-versus-all, dense rando, sparse rando, DECOC [7] and ECOC-ONE [6]. For each decoding ethod, only the best coding result is reported. The coparison results are shown in Table. We can see that the proposed decoding ethod outperfors other decoding ethods. Large iproveents are gained for datasets Letter and Satiage which have a larger nuber of training exaples copared with the others. 5.2. The Ipact of Rap Loss Settings The second experient is conducted to evaluate the ipact of rap loss with different paraeter settings. The data used for this experients is 5-scene iage recognition data set [2]. There are 4485 iages in total. 00 iages are selected fro each category for training. The reaining 2985 iages are used for testing. 0-fold cross validation is applied. For visual description, we use spatial Principal coponent Analysis of Census Transfor histogras (spact) [20, 9, 22] as feature. The result is shown in Fig.. The left plot shows the changes of the nuber of SVs as a function of the nuber of training exaples. It can be seen that, for Hinge loss, naely Rap (,), the nuber of SVs increases linearly with the nuber of training exaples. It becoes sublinear when Rap loss is used. We can observe that better scalability (less SVs) is achieved when s l increases. Larger s l leads to a shorter SVs internal. Subsequently, the nuber Figure. Ipacts of s l on learning scalability and generalization perforance. of SVs is draatically reduced. It indicates that Rap loss give us the ability to control the scalability of the algorith by controlling the length of SVs interval. The right plot in Fig. shows the ipact of s l on both the generalization perforance and scalability. Two trends can be observed: () At the early stage, the nuber of SVs increases as s l decreases, and the generalization error ieduced correspondingly. However, the generalization error starts to increase when further decreent is put on s l after it reaches 0.9. (2) The nuber of SVs iapidly decreased when s l increases. But when becoes larger than 0.9, the generalization perforance also starts to drop rapidly. These trends verify our conjecture that the second hinge point s l gives us the flexibility to control the balance between generalization perforance and scalability. Depend on specific proble and its training data, an optial s l is possible to be obtained in ters of the optial balance between coputational cost and generalization perforance gain. 5.3. Multi-class HTL for Object Recognition The last experient is conducted on Caltech-256 dataset for verifying the proposed ulti-class HTL ethod. We use siilar experient setting as the one used in [0]. We choose 0 classes (bonsai, gorilla, horse, otorbikes, ushroo, segway, skateboard, skunk, snowobile, sunflower) as target classes. Maxiu 30 exaples are selected fro each class for training, and 50 exaple are used for testing. 20 classes (pal-tree, cactus, fern, hibiscus, bat, bear, leopards, zebra, dolphin, killer-whale, ountain-

The ethod is further extended for ulti-class HTL. Source doain hypotheses are exploited for leveraging target doain training via an alternative optiization process between coding space learning and target doain training. Experiental result shows that it outputs the state-of-theart ulti-class HTL approaches. References Figure 2. Recognition accuracy for Caltech-256 (0 target classes and 20 source classes) ulti-class object categorization. bike, fire-truck, car-side, bulldozer, cael, dog, raccoon, chip, school-bus and touring-bike) are used as source classes. PHOG shape descriptor, SIFT appearance descriptor, Region Covariance and Local Binary Patterns are used as visual features for representing exaples. One-versus-all strategy and SVM with RBF kernel are adopted for generating source doain hypotheses. For the proposed ethod, one-versus-all strategy is used for initializing target doain training. SVM with RBF kernel is used for training target doain classifiers. Balancing paraeter λ and RBF kernel paraeter γ are obtained by cross-validation. We copare our ethod with three state-of-the-art HTL approaches: MKTL [0], MultiKT [9] and MULTIpLE []. Since MultiKT is a binary transfer learning approach, one-versus-all strategy is adopted for it to achieve ulti-class HTL. Learning without knowledge transferring is also copared as a baseline. It adopts one-versus-rest strategy and only uses target doain exaples for training classifiers. The experient runs 0 ties. Training and testing exaples are randoly drawn fro each category for each run. Recognition accuracies are averaged over runs and classes. The result is shown in Fig. 2 with the nuber of training exaples per class increasing. Fro Fig. 2, we can see that all the HTL approaches perfor better than learning without knowledge transferring, and the proposed ethod consistently outperfors the other HTL approaches. The iproveent becoes larger when the nuber of training exaples per class increases. The reason is that, with ore target doain training exaples available, the proposed ethod iproves not only target doain classifier learning but also coding space learning. 6. Conclusion A learning-based decoding ethod is proposed for iproving ulti-class classification. It exploitap loss to easure ulti-class decoding error and refines coding with real-valued codes. Both theoretical analysis and nuerical results show its superiority over existing approaches. [] E. L. Allwein and et al. Reducing ulticlass to binary: A unifying approach for argin classifiers. JMLR, :3 4, 200., 2, 3 [2] A. Asuncion and D. Newan. Uci achine learning repository [http://www.ics.uci.edu/ learn/lrepository. htl]. university of california irvine. 2007. 6 [3] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499 526, 2002. 4 [4] O. Chapelle, C. Do, Q. Le, A. Sola, and C. Teo. Tighter bounds for structured estiation. NIPS, pages 28 288, 2008. 2, 5 [5] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In ICML, pages 20 208. ACM, 2006. 2, 3, 5 [6] K. Craer and Y. Singer. Iproved output coding for classification using continuouelaxation. NIPS, pages 437 443, 200. 3 [7] T. G. Dietterich and G. Bakiri. Solving ulticlass learning probles via error-correcting output codes. JAIR, 2(263):286, 995., 2 [8] S. Escalera and et al. On the decoding process in ternary errorcorrecting output codes. TPAMI, 32():20 34, 200. 2, 3, 6 [9] G. Heran and et al. Region-based iage categorization with reduced feature set. In MMSP, pages 586 59, 2008. 6 [0] L. Jie and et al. Multiclass transfer learning fro unconstrained priors. In ICCV, pages 863 870, 20., 2, 6, 7 [] I. Kuzborskij, F. Orabona, and B. Caputo. Fro n to n+ : Multiclass transfer increental learning. In CVPR, 203., 2, 7 [2] S. Lazebnik, C. Schid, and J. Ponce. Beyond bags of features: Spatial pyraid atching for recognizing natural scene categories. In CVPR, volue 2, pages 269 278, 2006. 6 [3] L. Li. Multiclass boosting with repartitioning. In ICML, pages 569 576. ACM, 2006., 2 [4] C. McDiarid. On the ethod of bounded differences. Surveys in cobinatorics, 4():48 88, 989. 4 [5] N. J. Nilsson. Learning achines. 965. [6] O. Pujol and et al. An increental node ebedding technique for error correcting output codes. PR, 4(2):73 725, 2008. 6 [7] O. Pujol, P. Radeva, and J. Vitria. Discriinant ecoc: A heuristic ethod for application dependent design of error correcting output codes. TPAMI, 28(6):007 02, 2006., 6 [8] P. D. Tao et al. The dc (difference of convex functions) prograing and dca revisited with dc odels of real world nonconvex optiization probles. Annals of Operations Research, 33(-4):23 46, 2005. 2, 4 [9] T. Toasi, F. Orabona, and B. Caputo. Safety in nubers: Learning categories fro few exaples with ulti odel knowledge transfer. In CVPR, pages 308 3088, 200., 2, 7 [20] J. Wu and J. M. Rehg. Where a i: Place instance and category recognition using spatial pact. In CVPR, pages 8, 2008. 6 [2] A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Coputation, 5(4):95 936, 2003. 2, 4 [22] B. Zhang and et al. Finding shareable inforative patterns and optial coding atrix for ulticlass boosting. In ICCV, pages 56 63, 2009., 2, 6 [23] B. Zhang and et al. Multi-class graph boosting with subgraph sharing for object recognition. In ICPR, pages 54 544, 200., 2 [24] B. Zhang and et al. Multilabel iage classification via high-order label correlation driven active learning. IEEE TIP, 23(3), 204. [25] Y. Zhang and J. Schneider. Maxiu argin output coding. ICML, 202. 2