BDUOL: Double Updating Online Learning on a Fixed Budget

Transcription

1 BDUOL: Double Updating Online Learning on a Fixed Budget Peilin Zhao and Steven C.H. Hoi School of Computer Engineering, Nanyang Technological University, Singapore {zhao0106,chhoi}@ntu.edu.sg Abstract. Kernel-based online learning often exhibits promising empirical performance for various applications according to previous studies. However, it often suffers a main shortcoming, that is, the unbounded number of support vectors, making it unsuitable for handling large-scale datasets. In this paper, we investigate the problem of budget kernel-based online learning that aims to constrain the number of support vectors by a predefined budget when learning the kernel-based prediction function in the online learning process. Unlike the existing studies, we present a new framework of budget kernel-based online learning based on a recently proposed online learning method called Double Updating Online Learning (DUOL), which has shown state-of-the-art performance as compared with the other traditional kernel-based online learning algorithms. We analyze the theoretical underpinning of the proposed Budget Double Updating Online Learning (BDUOL) framework, and then propose several BDUOL algorithms by designing different budget maintenance strategies. We evaluate the empirical performance of the proposed BDUOL algorithms by comparing them with several well-known budget kernel-based online learning algorithms, in which encouraging results validate the efficacy of the proposed technique. 1 Introduction The goal of kernel-based online learning is to incrementally learn a nonlinear kernel-based prediction function from a sequence of training instances [1 4]. Although it often yields significantly better performance than linear online learning, the main shortcoming of kernel-based online learning is its potentially unbounded number of support vectors with the kernel-based prediction function, which thus requires a large amount of memory for storing support vectors and a high computational cost of making predictions at each iteration, making it unsuitable for large-scale applications. In this paper, we aim to tackle this challenge by studying a framework for kernel-based online learning on a fixed budget or known as budget online learning for short, in which the number of support vectors for the prediction function is bounded by some predefined budget size. In literature, several algorithms have been proposed for budget online learning. Crammer et al. [5] proposed a heuristic approach for budget online learning

2 2 Zhao and Hoi by extending the classical kernel-based perceptron method [6], which was further improved in [7]. The basic idea of these two algorithms is to remove the support vector that has the least impact on the classification performance whenever the budget, i.e., the maximal number of support vectors, is reached. The main shortcoming of these two algorithms is that they are heuristic without solid theoretic supports (e.g., no any mistake/regret bound was given). Forgetron [8] is perhaps the first approach for budget online learning that offers a theoretical bound of the total number of mistakes. In particular, at each iteration, if the online classifier makes a mistake, it conducts a three-step updating: (i) it first runs the standard Perceptron [6] for updating the prediction function; (ii) it then shrinks the weights of support vectors by a carefully chosen scaling factor; and (iii) it finally removes the support vector with the least weight. Another similar approach is the Randomized Budget Perceptron (RBP) [9], which randomly removes one of existing support vectors when the number of support vectors exceeds the predefined budget. In general, RBP achieves similar mistake bound and empirical performance as Forgetron. Unlike the above strategy that discards one support vector to maintain the budget, Projectron [10] adopts a projection strategy to bound the number of support vectors. Specifically, at each iteration where a training example is misclassified, it first updates the kernel classifier by applying a standard Perceptron; it then projects the new classifier into the space spanned by all the support vectors except the new example received at the current iteration, if the difference between the new classifier and its projection is less than a given threshold, otherwise it will remain unchanged. Empirical studies show that Projectron usually outperforms Forgetron in classification but with significantly longer running time. In addition to its high computational cost, another shortcoming of Projectron is that although the number of support vectors is bounded, it is unclear the exact number of support vectors achieved by Projectron in theory. The above budget online learning approaches were designed based on the Perceptron learning framework [6]. In this paper, we propose a new framework of Budget Double Updating Online Learning (BDUOL) based on a recently proposed Double Updating Online Learning(DUOL) technique[4], which has shown state-of-the-art performance for online learning. The key challenge is to develop an appropriate strategy for maintaining the budget whenever the size of support vectors overflows. In this paper, following the theory of double updating online learning, we analyze the theoretical underpinning of the BDUOL framework, and propose a principled approach to developing three different budget maintenance strategies. We also analyze the mistake bounds of the proposed BDUOL algorithms and evaluate their empirical performance extensively. The rest of the paper is organized as follows. Section 2 first introduces the problem setting and then presents both theoretical and algorithmic framework of the proposed budget online learning technique. Section 3 presents several different budget maintenance strategies for BDUOL. Section 4 discusses our empirical studies. Section 5 concludes this work.

3 BDUOL: Budget Double Updating Online Learning 3 2 Double Updating Online Learning on A Fixed Budget In this section, we first introduce the problem setting for online learning and Double Updating Online Learning (DUOL), and then present the details of the proposed Budget Double Updating Online Learning framework. 2.1 Problem Setting We consider the problem of online classification on a fixed budget. Our goal is to learn a prediction function f : R d R from a sequence of training examples {(x 1,y 1 ),...,(x T,y T )}, where x t R d is a d-dimensional instance and y t Y = { 1,+1} is the class label assigned to x t. We use sign(f(x)) to predict the class assignment for any x, and f(x) to measure the classification confidence. Let l(f(x),y) : R Y R be the loss function that penalizes the deviation of estimates f(x) from observed labels y. We refer to the output f of the learning algorithm as a hypothesis and denote the set of all possible hypotheses by H = {f f : R d R}. In this paper, we consider H a Reproducing Kernel Hilbert Space (RKHS) endowed with a kernel function κ(, ) : R d R d R [11] implementing the inner product, such that: 1) reproducing property f,κ(x, ) = f(x) for x R d ; 2) H is the closure of the span of all κ(x, ) with x R d, that is, κ(x, ) H for every x X. The inner product, induces a norm on f H in the usual way: f H := f,f 1 2. To make it clear, we use H κ to denote an RKHS with explicit dependence on kernel function κ. Throughout the analysis, we assume κ(x,x) 1 for any x R d. 2.2 Double Updating Online Learning: A Review Our BDUOL algorithm is designed based on the state-of-the-art Double Updating Online Learning (DUOL) method [4]. Unlike traditional online learning algorithms that usually perform a single update for each misclassified example, DUOL not only updates the weight for the newly added Support Vector (SV), but also updates that of another existing SV, which conflicts most with the new SV. Furthermore, both theoretical and empirical analysis have demonstrated the effectiveness of this algorithm. Below we briefly review the basics of double updating online learning. Consider an incoming instance x t received at the t-th step of online learning. The algorithm predicts the class label ŷ t = sgn(f t 1 (x t )) using the following kernel-based classifier: f t 1 ( ) = i S t 1 γ i y i κ(x i, ), wheres t 1 istheindexsetofthesvsforthe(t 1)-thstep,and γ i istheweightof the i-th existing support vector. After making the prediction, the algorithm will suffer a loss, defined by a hinge loss as l(f t 1 (x t ),y t ) = max(0,1 y t f t 1 (x t )). If l(f t 1 (x t ),y t ) > 0, the DUOL algorithm will update the prediction function f t 1 to f t by adding the training example (x t,y t ) as a new support vector.

4 4 Zhao and Hoi Specifically, when the new added example (x t,y t ) conflicts with (x b,y b ), b S t 1, by satisfying conditions:1)l t = 1 y t f t 1 (x t ) > 0;2) l b = 1 y b f t 1 (x b ) > 0; 3) y t y b κ(x t,x b ) min( ρ,y t y a κ(x t,x a )), a S t 1, and a b, where ρ [0, 1) is a threshold, then the updating strategy referred as double updating will be adopted as follows: f t ( ) = f t 1 ( )+γ t y t κ(x t, )+d γb y b κ(x b, ), where γ t and d γb are computed in the following equations: (C,C ˆγ b ) if (k t C +w ab (C ˆγ b ) l t ) < 0 and (k b (C ˆγ b )+w ab C l b ) < 0 (C, l b w ab C k b ) if w2 ab C w abl b k tk b C+k b l t k b > 0 and l b w ab C k b [ ˆγ b,c ˆγ b ] (γ t,d γb )= ( lt w ab(c ˆγ b ) k t,c ˆγ b ) if lt w ab(c ˆγ b ) k t [0,C] and,(1) l l b k b (C ˆγ b ) w t w ab (C ˆγ b ) ab k t > 0 ( k bl t w ab l b k tk b, ktl b w ab l t wab 2 k tk b ) if k bl t w ab l b wab 2 k tk b [0,C] and wab 2 [ ˆγ b,c ˆγ b ] k tl b w ab l t k tk b w 2 ab where k t = κ(x t,x t ),k b = κ(x b,x b ), w ab = y t y b κ(x t,x b ) and C > 0; or when no existing SV conflicts with (x t,y t ), the single update strategy will be adopted as follows: f t ( ) = f t 1 ( )+γ t y t κ(x t, ), (2) where γ t = min(c,l t /kt 2 ). It is not difficult to see that the single update strategy is reduced to the Passive-Aggressive updating strategy [3]. 2.3 Framework of Budget Double Updating Online Learning Although DUOL outperforms various traditional single updating algorithms, one major limitation is that it does not bound the number of support vectors, which could result in high computation and heavy memory cost when being applied to large-scale applications. In this paper, we aim to overcome this limitation by proposing a budget double updating online learning framework in which the number of support vectors is bounded by a predefined budget size. Let us denote by B a predefined budget size for the maximal number of support vectors associated with the prediction function. The key difference of Budget DUOL over regular DUOL is to develop an appropriate budget maintenance step to ensure that the number of support vectors with the classifier f t is less than the budget B at the beginning of each online updating step. In particular, let us denote by f t 1 the classifier produced by a regular DUOL at the t 1-th step, when the support vector size of f t 1 is equal to the B, BDUOL performs the classifier update towards budget maintenance: f t 1 f t 1 f t 1 such that the support vector size of the updated f t 1 is smaller than B. The details of the proposed Budget DUOL (BDUOL) algorithmic framework are summarized in Algorithm 1.

5 BDUOL: Budget Double Updating Online Learning 5 Algorithm 1 The Budget Double Updating Online Learning Algorithm (BDUOL) Procedure 1: Initialize S 0 =, f 0 = 0; 2: for t=1,2,...,t do 3: Receive a new instance x t; 4: Predict ŷ t = sign(f t 1(x t)); 5: Receive its label y t; 6: l t = max{0,1 y tf t 1(x t)}; 7: if l t > 0 then 8: if ( S t == B) then 9: f t 1 = f t 1 f t 1; (Budget Maintenance) 10: end if 11: l t = max{0,1 y tf t 1(x t)}; 12: if l t > 0 then 13: w min = ; 14: for i S t 1 do 15: if (ft 1 i 1) then 16: if (y iy tκ(x i,x t) w min) then 17: w min = y iy tκ(x i,x t); 18: (x b,y b ) = (x i,y i); 19: end if 20: end if 21: end for 22: ft 1 t = y tf t 1(x t); 23: S t = S t 1 {t}; 24: if (w min ρ) then 25: Compute γ t and d γb using equation (1); 26: for i S t do 27: ft i ft 1 i +y iγ ty tκ(x i,x t)+y id γb y b κ(x i,x b ); 28: end for 29: f t = f t 1 +γ ty tκ(x t, )+d γb y b κ(x b, ); 30: else /* no auxiliary example found */ 31: γ t = min(c,l t/κ(x t,x t)); 32: for i S t do 33: ft i ft 1 i +y iγ ty tκ(x i,x t); 34: end for 35: f t = f t 1 +γ ty tκ(x t, ); 36: end if 37: else 38: f t = f t 1; S t = S t 1; 39: for i S t do 40: ft i ft 1; i 41: end for 42: end if 43: end if 44: end for return f T, S T End Fig. 1. The Algorithms of Budget Double Updating Online Learning (BDUOL).

6 6 Zhao and Hoi The key challenge of BDUOL is to choose an appropriate reduction term f t 1 byaproperbudget maintenancestrategy,whichcan onlymeet the budget requirement but also minimize the impact of the reduction on the prediction performance. Unlike some existing heuristic budget maintenance approaches, in this paper, we propose a principled approach for developing several different budget maintenance strategies. Before presenting the detailed strategies, in the following, we analyze the theoretical underpinning of the proposed budget double updating online learning scheme, which is the theoretical foundation for the proposed budget maintenance strategies in Section Theoretical Analysis In this section, we analyze the mistake bound of the proposed BDUOL algorithm. To simplify the analysis, the primal-dual framework is used to derive the mistake bound following the strategy of DUOL algorithm. Through this framework, we will show that the gap between the mistake bound of DUOL and BDUOL is bounded by the cumulative dual ascent induced by the function reduction, i.e., f = γ i y i κ(x i, ). To facilitate the analysis, we firstly introduce the following lemma, which provides the dual objective function of the SVM. Lemma 1. The dual objective of P t (f) = 1 2 f H κ +C t l(f(x i),y i ), C > 0 is D t (γ 1,...,γ t ) = γ i 1 2 γ i y i κ(x i, ) 2 H κ, γ i [0,C], (3) where the relation between f and γ i, i = 1,...,t is f( ) = t γ iy i κ(x i, ). According to the above lemma, after the t-th budget maintenance, the resultant dual ascent will be computed as follows: Theorem 1 The Dual Ascent DA t = D t (γ 1 γ 1,...,γ t γ t ) D t (γ 1,...,γ t ) for the t-th budget maintenance, i.e., f t = f t f t, is given as follows: Proof. DA t = γ i + γ i y i f t (x i ) 1 2 f t 2 H κ. (4) D t (γ 1 γ 1,...,γ t γ t ) D t (γ 1,...,γ t ) = (γ i γ i ) 1 2 (γ i γ i )y i κ(x i, ) 2 H κ [ γ i 1 2 γ i y i κ(x i, ) 2 H κ ] = = γ i [ f t 2 H κ f t f t 2 H κ ] γ i [ f t 2 H κ f t 2 H κ +2 f t, f t f t 2 H κ ]

7 = = = γ i + f t, f t 1 2 f t 2 H κ γ i + f t, γ i + BDUOL: Budget Double Updating Online Learning 7 γ i y i κ(x i, ) 1 2 f t 2 H κ γ i y i f t (x i ) 1 2 f t 2 H κ. Based on the above dual ascent for budget maintenance, we can now analyze the mistake bound of the proposed BDUOL algorithm. To ease our discussion, we first introduce the following lemma [4] about the mistake bound of DUOL. Lemma 2. Let (x 1,y 1 ),...,(x T,y T ) be a sequence of examples, where x t R d, y t { 1,+1} and κ(x t,x t ) 1 for all t, and assume C 1. Then for any function f in H κ, the number of prediction mistakes M made by DUOL on this sequence of examples is bounded by: 2 min f H κ { 1 2 f 2 H κ +C T where ρ [0,1), Md w(ρ) > 0 and Ms d (ρ) > 0. l(f(x i ),y i ) } ρ2 1+ρ 2 Mw d (ρ) 1 ρ Ms d (ρ), Combining the above lemma with Theorem 1, it is not difficult to derive the following mistake bound for the proposed BDUOL algorithm. Theorem 2 Let (x 1,y 1 ),...,(x T,y T ) be a sequence of examples, where x t R d, y t { 1,+1} and κ(x t,x t ) 1 for all t, and assume C 1. Then for any function f in H κ, the number of prediction mistakes M made by BDUOL on this sequence of examples is bounded by: 2 min f H κ { 1 2 f 2 H κ +C where ρ [0,1). T } l(f(x i ),y i ) 2 T Proof. According to the proof of Lemma 2 [4], we have 1 2 M s + 1+ρ2 Md w 2 (ρ)+ 1 { 1 1 ρ Ms d (ρ) min f H κ 2 f 2 H κ +C DA i ρ2 2 Mw d (ρ) 1+ρ 1 ρ Ms d(ρ), T } l(f(x i ),y i ), and M = M s + Md w(ρ) + Ms d (ρ). Furthermore, by taking the dual ascents of budget maintenance into consideration, we have T DA i + M s ρ2 Md w 2 (ρ)+ Ms d (ρ) { 1 1 ρ min f H κ 2 f 2 H κ +C Rearranging the above inequality will concludes the theorem. T } l(f(x i ),y i ).

8 8 Zhao and Hoi According to the above theorem, we can see that, if there is no budget maintenance step, the mistake bound of BDUOL is reduced to the previous mistake bound for the regular DUOL algorithm. This theorem indicates that in order to minimize the mistake bound of BDUOL, one should try to maximize the cumulative dual ascent, i.e., T DA i, when designing an appropriate budget maintenance strategy. 3 Budget Maintenance Strategies In this section, we follow the above theoretical results to develop several different budget maintenance strategies in a principled approach. In particular, as revealed by Theorem 2, a key to improving the mistake bound of BDUOL is to maximize the cumulative dual ascent T t=1 DA t caused by the budget maintenance. To achieve this purpose, we propose to maximize the dual ascent caused by budget maintenance at each online learning step, i.e., max DA t = γ 1,..., γ t γ i + γ i y i f t (x i ) 1 2 f t 2 H κ. (5) Below, we propose three different budget maintenance strategies, and analyze the principled approach of achieving the best dual ascent as well as the time complexity and memory cost for each strategy. 3.1 BDUOL Algorithm by Removal Strategy The first strategy for budget maintenance is the removal strategy that discards one of existing support vectors, which is similar to the strategies used by Forgetron [8] and RBP [9]. Unlike the previous heuristic removal strategy, the key idea of our removal strategy is to discard the support vector which can maximize the dual ascent by following our previous analysis. Specifically, let us assume the j-th SV is selected for removal. We then have the following function reduction term: f t = γ j y j κ(x j, ). (6) Asaresult,theoptimalremovalsolutionistodiscardtheSVwhichcanmaximize the following dual ascent term: DA t,j = γ j (1 y j f t (x j )) 1 2 (γ j) 2 κ(x j,x j ). (7) We note that the above removal strategy is similar with the one in [5], when the Gaussian kernel is adopted where κ(x j,x j ) = 1. However, our strategy is strongly theoretically motivated. Complexity Analysis. Since the BDUOL algorithm will cache the value y j f t (x j ) foreverysv, the computationalcomplexityofda t,j is O(1) in practice. Thus, this BDUOL algorithm requires O(B) time complexity of computing all

9 BDUOL: Budget Double Updating Online Learning 9 the DA t,j s. After removing the SV with the largest DA t,j, the complexity of updating all the values of y j f t (x j ) is O(B). Furthermore, combining the above discussion with the fact that the original DUOL s time complexity is O(B), we can conclude that the overall time complexity of this BDUOL algorithm is also O(B). As for the memory cost, since only B SVs, their weight parameters and the y j f t (x j )s have to be cached, the space complexity is thus also O(B). 3.2 BDUOL Algorithm by Projection Strategy Although the above removal strategy is optimized to find the best support vector that maximizes the dual ascent (i.e., minimizes the loss of dual ascent caused by budget maintenance), it is still unavoidable to result in the loss of the dual ascent due to the removal of one existing support vector. To minimize such loss, we propose a projection strategy for budget maintenance. Specifically, in the projection strategy, the selected j-th SV for removal will be projected to the space spanned by the rest SVs. The objective is to find the function closest to γ j y j κ(x j, ) in the space spanned by the remaining SVs, or formally: min γ jy j κ(x j, ) β i y i κ(x i, ) 2 H β κ. (8) i [ γ i,c γ i],i j i j which is essentially a Quadratic Programming (QP) problem. So, we can exploit the existing efficient QP solvers to find the optimal solution. However, solving the above QP problem directly may not be efficient enough for online learning purpose. To further improve the efficiency, we also proposed an approximate solution by firstly solving the unconstrained optimization problem and then projecting the solution into the feasible region of the constraints. Specifically, setting the gradient of the above equation with respect to β = [β i ], i j as zero, one can obtain the optimal solution as β = γ j y j K 1 k j./y, (9) where K is the kernel matrix for x i,i j, k j = [κ(x i,x j )], i j,./ is elementwise division and y = [y i ], i j. In the above, inverting K can be efficiently realized by using Woodbury formula [12]. As a result, we should set f t ( ) = γ j y j κ(x j, ) i j β i y i κ(x i, ). (10) However, the resultant f t f t s SV weights may not fall into the range [0,C]. To fix this problem, we will project each β i as follows: β = Π [ γ,c γ] (γ j y j K 1 k j./y), (11) where Π [a,b] (x) = max(a,min(b,x)) and γ = [γ i ], i j.

10 10 Zhao and Hoi Now the key problem is to find the best SV among B + 1 candidates for projection. Since after the projection of γ j y j κ(x j, ), the resultant dual ascent is DA t,j = i γ i + i γ i y i f t (x i ) 1 2 f t 2 H κ. (12) So the best SV is the one which can achieve the largest value DA t,j. Complexity Analysis. The time complexity for BDUOL is dominated by the computation K 1. Computing K 1 using K will cost O(B 3 ) time, however using Woodbury formula, we can efficiently compute it using only O(B 2 ) time. In addition, we need to compute B times projections for every SV, so the total time complexity for one step of updating is O(B 3 ). Finally, the memory burden for the BDUOL algorithm is O(B 2 ), since the storage of kernel matrix K and inverse kernel matrix K 1 dominated the main memory cost. 3.3 BDUOL Algorithm by Nearest Neighbor Strategy The above projection strategy is able to achieve a better improvement of dual ascent than the removal strategy, it is however much more computationally expensive. To balance the tradeoff between efficiency and effectiveness, we propose an efficient nearest neighbor strategy which approximates the projection strategy by projecting the removed SV to its nearest neighbor SV, based on the distance in the mapped feature space. This strategy is motivated by the fact that the nearest neighbor SV usually could be a good representative of the removed SV. Using this strategy, we can significantly improve the time efficiency. In particular, as we use only the nearest neighbor for projection, the corresponding solution according to equation 11 can be expressed: β Nj = Π [ γnj,c γ Nj ](γ j y j κ(x Nj,x Nj ) 1 κ(x Nj,x j )/y Nj ), (13) where κ(x Nj, ) is the nearest neighbor of κ(x j, ). As a result, the corresponding f t = γ j y j κ(x j, ) β Nj y Nj κ(x Nj, ). Since after the projection of γ j y j κ(x j, ), the resultant dual ascent is DA t,j = i γ i + i γ i y i f t (x i ) 1 2 f t 2 H κ. (14) Thus, the best SV is the one that has the largest value DA t,j. Complexity Analysis. All the computation steps except looking for the nearest neighbor have constant time complexity of O(1). For improving nearest neighbor searching, we can cache the indexes and the distances about the nearest neighbors of the SVs, making finding the nearest neighbor in O(1). After remove the best SV, we should also update the caches. The time complexity for this updating is at most O(B). In summary, the time complexity for the overall strategy is O(B), and the overall memory cost is also O(B). Remark: To improve the efficiencies of the projection and nearest neighbor strategies, we could use the projection or the nearest neighbor strategy to keep the information from the SV selected by the removal strategy and then remove it, since the search cost of the removal strategy is quite lower than the other two strategies.

11 4 Experimental Results BDUOL: Budget Double Updating Online Learning 11 In this section, we evaluate the empirical performance of the proposed algorithms for Budget Double Updating Online Learning(BDUOL) by comparing them with the state-of-the-art algorithms for budget online learning. 4.1 Algorithms for Comparison In our experiments, we implement the proposed BDUOL algorithms as follows: BDUOL remo : the BDUOL algorithm by the removal strategy for budget maintenance described in section 3.1, BDUOL proj : the BDUOL algorithm by the exact projection strategy by a standard QP solver for budget maintenance described in section 3.2, BDUOL appr : the BDUOL algorithm by the approximate projection strategy for budget maintenance described in section 3.2, BDUOL near : the BDUOL algorithm by the nearest neighbor strategy for budget maintenance described in section 3.3, For comparison, we include the following state-of-the-art algorithms for budget online learning: RBP : the Random Budget Perceptron algorithm [9], Forgetron : the Forgetron algorithm [8], Projectron : the Projectron algorithm [10], and Projectron++ : the aggressive version of Projectron algorithm [10]. Besides, we also include two non-budget online learning algorithms as yardstick: Perceptron : the classical Perceptron algorithm [6], and DUOL : the Double Updating Online Learning algorithm [4]. 4.2 Experimental Testbed and Setup Table 1. Details of the datasets in our experiments. Dataset # instances # features german MITface mushrooms spambase splice w7a We test all the algorithms on six benchmark data sets from web machine learning repositories listed in Table 1. These data sets can be downloaded from LIBSVM website 1, UCI machine learning repository 2, and MIT CBCL face

12 12 Zhao and Hoi data sets 3. These datasets were chosen fairly randomly to cover various sizes of datasets. To make a fair comparison, all the algorithms in our comparison adopt the same experimental setup. A gaussian kernel is adopted in our study, for which the kernel width is set to 8 for all the algorithms and datasets. To make the number of support vectors fixed for Projectron and Projectron++ algorithms, we simply store the received SVs before the budget overflows, and then project the subsequent ones into the space spanned by the stored SVs afterward. The penalty parameter C in the DUOL algorithm was selected by 5-fold cross validation for all the datasets from range 2 [ 10:10]. Due to the cross-validation, we randomly divide every dataset into two equal subsets: cross validation (CV) dataset and online learning dataset. And C is set as the same value with DUOL. Furthermore, the value ρ for the DUOL and its budget variants is set as 0, according to the previous study on the effect of ρ. The budget sizes B for different datasets are set as proper fractions of the support vector size of Perceptron, which are shown in Table 3. All the experiments were conducted 20 times, each with a different random permutation of data points. All the results were reported by averaging over the 20 runs. For performance metrics, we evaluate the online classification performance by evaluating online cumulative mistake rates and running time cost. 4.3 Performance Evaluation of Non-budget Algorithms Table 2 summarizes the average performance of the two non-budget algorithms for kernel-based online learning. First of all, similar to the previous study [4], we found that DUOL outperforms Perceptron significantly for all the datasets according to t-test results, which validates our motivation of choosing DUOL as the basic online learning algorithm for budget online learning. Second, we noticed that the support vector size of DUOL is in general much larger than that of Perceptron. Finally, the time cost of DUOL is much higher than that of Perceptron, mostly due to the larger number of support vectors. Both the large number of support vectors and high computational time motivate the need of studying budget DUOL algorithms in this work. Algorithm Perceptron DUOL Datasets Mistake (%) Support Vectors (#) Time (s) Mistakes (%) Support Vectors (#) Time (s) german %± ± %± ± MITface %± ± %± ± mushrooms %± ± %± ± spambase %± ± %± ± splice %± ± %± ± w7a %± ± %± ± Table 2. Evaluation of non-budget algorithms on the the data sets. 3

13 BDUOL: Budget Double Updating Online Learning Performance Evaluation of Budget Algorithms Table 3 summarizes the results of different budget online learning algorithms. We can draw several observations. First of all, we observe that RBP and Forgetron achieve very similar performance for most cases. In addition, we also find that Projectron++ achieves a lower mistake rate than Projectron for almost all the datasets and for varied budget sizes, which is similar to the previous results reported in [10]. Moreover, compared with the baseline algorithms RBP and Forgetron, the proposed BDUOL remo algorithm using a simple removal strategy achieves comparable or better mistake rate when the budget size is large, but fails to improve when the budget size is very small, which indicates a simple removal strategy may not be always effective and a better budget maintenance strategy is needed. Second, among all the algorithms in comparison for budget online learning, we find that BDUOL proj always achieves the lowest mistake rates for most cases. These promising results indicate the projection strategy can effectively reduce the information loss. However, we also notice that the time cost of the BDUOL proj is among the highest ones, which indicates it is important to find some more efficient strategy. Third, by comparing two approximate strategies, we find that BDUOL appr achieves better mistake rates than BDUOL near only on the german and mushroomsdatasets,while it consumestoo muchtime than the proposedbduol near algorithms, which indicates BDUOL near achieves better trade off between mistake rates and time complexity than BDUOL appr. In addition, when the number of budget is large, BDUOL near always achieves similar performance with the BDUOL proj, while consumes significantly less time, which indicates the proposed nearest neighbor strategy is a good alternative of the projection strategy. Finally, Figure 2 and Figure 3 show the detailed online evaluation processes of the several budget online learning algorithms. Similar observations from these figures further verified the efficacy of the proposed BDUOL technique. 5 Conclusions This paper presented a new framework of budget double updating online learning for kernel-based online learning on a fixed budget, which requires the number of support vectors associated with the prediction function is always bounded by a predefined budget. We theoretically analyzed its performance, which reveals that its effectiveness is tightly connected with the dual ascent achieved by the model reduction for budget maintenance. Based on the theoretical analysis, we proposed three budget maintenance strategies: removal, projection, and nearest neighbor. We evaluate the proposed algorithms extensively on benchmark datasets. The promising empirical results show that the proposed algorithms outperform the state-of-the-art budget online learning algorithms in terms of mistake rates. Future work will exploit different budget maintenance strategies and extend the proposed work to multi-class budgeted online learning.

14 14 Zhao and Hoi Budget Size B=50 B=100 B=150 Dataset Algorithm Mistake (%) Time (s) Mistakes (%) Time (s) Mistakes (%) Time (s) german RBP %± %± %± Fogetron %± %± %± Projectron %± %± %± Projectron %± %± %± BDUOLremo %± %± %± BDUOLnear %± %± %± BDUOLappr %± %± %± BDUOL proj %± %± %± Budget Size B=50 B=100 B=150 Dataset Algorithm Mistake (%) Time (s) Mistakes (%) Time (s) Mistakes (%) Time (s) MITface RBP %± %± %± Fogetron %± %± %± Projectron %± %± %± Projectron %± %± %± BDUOLremo %± %± %± BDUOLnear %± %± %± BDUOLappr %± %± %± BDUOL proj %± %± %± Budget Size B=50 B=75 B=100 Dataset Algorithm Mistake (%) Time (s) Mistakes (%) Time (s) Mistakes (%) Time (s) mushrooms RBP %± %± %± Fogetron %± %± %± Projectron %± %± %± Projectron %± %± %± BDUOLremo %± %± %± BDUOLnear %± %± %± BDUOLappr %± %± %± BDUOL proj %± %± %± Budget Size B=200 B=400 B=600 Dataset Algorithm Mistake (%) Time (s) Mistakes (%) Time (s) Mistakes (%) Time (s) spambase RBP %± %± %± Fogetron %± %± %± Projectron %± %± %± Projectron %± %± %± BDUOLremo %± %± %± BDUOLnear %± %± %± BDUOLappr %± %± %± BDUOL proj %± %± %± Budget Size B=100 B=200 B=300 Dataset Algorithm Mistake (%) Time (s) Mistakes (%) Time (s) Mistakes (%) Time (s) splice RBP %± %± %± Fogetron %± %± %± Projectron %± %± %± Projectron %± %± %± BDUOLremo %± %± %± BDUOLnear %± %± %± BDUOLappr %± %± %± BDUOL proj %± %± %± Budget Size B=300 B=400 B=500 Dataset Algorithm Mistake (%) Time (s) Mistakes (%) Time (s) Mistakes (%) Time (s) w7a RBP %± %± %± Fogetron %± %± %± Projectron %± %± %± Projectron %± %± %± BDUOLremo %± %± %± BDUOLnear %± %± %± BDUOLappr %± %± %± BDUOL proj %± %± %± Table 3. Evaluation of several budgeted algorithms with varied budget sizes.

15 BDUOL: Budget Double Updating Online Learning RBP Forgetron Projectron Projectron++ BDUOL remo BDUOL near BDUOL appr BDUOL proj (a) german(b=50) (b) german(b=100) (c) german(b=150) RBP Forgetron Projectron Projectron++ BDUOL remo BDUOL near BDUOL appr BDUOL proj (d) MITface(B=50) (e) MITface(B=100) (f) MITface(B=150) RBP Forgetron Projectron Projectron++ BDUOL remo BDUOL near BDUOL appr BDUOL proj (g) mushrooms(b=50) (h) mushrooms(b=75) (i) mushrooms(b=100) Fig.2. Evaluation of online mistake rates against the number of samples on three datasets. The plotted curves are averaged over 20 random permutations. Acknowledgments This work was supported by Singapore MOE tier 1 grant (RG33/11) and Microsoft Research grant (M ). References 1. Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8): , Li Cheng, S. V. N. Vishwanathan, Dale Schuurmans, Shaojun Wang, and Terry Caelli. Implicit online learning with kernels. In NIPS, pages , Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7: , Peilin Zhao, Steven C. H. Hoi, and Rong Jin. Double updating online learning. Journal of Machine Learning Research, 12: , 2011.

16 16 Zhao and Hoi RBP Forgetron Projectron Projectron++ BDUOL remo BDUOL near BDUOL appr BDUOL proj (j) spambase(b=200) (k) spambase(b=400) (l) spambase(b=600) RBP Forgetron Projectron Projectron++ BDUOL remo BDUOL near BDUOL appr BDUOL proj (m) splice(b=100) (n) splice(b=200) (o) splice(b=300) RBP Forgetron Projectron Projectron++ BDUOL remo BDUOL near BDUOL appr BDUOL proj (p) w7a(b=300) (q) w7a(b=400) (r) w7a(b=500) Fig.3. Evaluation of online mistake rates against the number of samples on three datasets. The plotted curves are averaged over 20 random permutations. 5. Koby Crammer, Jaz S. Kandola, and Yoram Singer. Online classification on a budget. In NIPS, Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65: , Jason Weston and Antoine Bordes. Online (and offline) on an even tighter budget. In AISTATS, pages , Ofer Dekel, Shai Shalev-Shwartz, and Yoram Singer. The forgetron: A kernel-based perceptron on a budget. SIAM J. Comput., 37(5): , Giovanni Cavallanti, Nicolò Cesa-Bianchi, and Claudio Gentile. Tracking the best hyperplane with a simple budget perceptron. Machine Learning, 69(2-3): , Francesco Orabona, Joseph Keshet, and Barbara Caputo. Bounded kernel-based online learning. Journal of Machine Learning Research, 10: , Vladimir N. Vapnik. Statistical Learning Theory. Wiley, Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vector machine learning. In Advances in Neural Information Processing Systems 13 (NIPS), pages , 2000.