Training Robust Support Vector Regression via D. C. Program

Journal of Information & Computational Science 7: 12 (2010) 2385 2394 Available at ttp://www.joics.com Training Robust Support Vector Regression via D. C. Program Kuaini Wang, Ping Zong, Yaoong Zao College of Science, Cina Agricultural University, Beijing 100083, Cina Abstract Te classical support vector macines are sensitive to noise and outliers. In tis paper, we propose a truncated quadratic insensitive loss function and develop a robust support vector regression wic as strong ability of suppressing te impact of noise and outliers wile at te same time keeps te sparseness. Since te truncated quadratic insensitive loss function is non-convex and non-differentiable, we construct a smoot loss function wic is te combination of two Huber loss functions as its approximation. Te resultant optimization problem can be formulated as a difference of convex functions program. We establis a Newton-type algoritm to solve it. Numerical experiments on te bencmark datasets sow tat te proposed algoritm as promising performance. Keywords: Support Vector Macine; Regression; Loss Function; Robustness; D. C. Program 1 Introduction Support vector macine (SVM) is a useful tool for macine learning, and it earns success in various aspects ranging from pattern recognition, classification, function estimation, time series prediction and so on [1, 2, 3]. In practice, sampling errors, modeling errors and instrument errors may corrupt te training samples wit noise and outliers. Te classical SVM yields poor generalization performance in te presence of noise and outliers. Tere are several kinds of metods to construct te robust SVMs. Te commonly used approac constructs te robust models by introducing weigted values to errors caused by training samples [4, 5, 6]. Anoter approac constructs te robust models based on te ramp loss functions [7, 8, 9]. In addition, we can construct te robust models by te second order cone programming [10, 11, 12]. As we know, loss functions play te essential role in supervised learning. One of important and popular loss functions is te quadratic loss function, and many SVMs are constructed by using tis loss function, suc as L2-SVM [1] and least squares SVM (LS-SVM) [13]. In tis paper, we introduce a non-convex and non-differentiable loss function based on te quadratic insensitive Project supported by te National Nature Science Foundation of Cina (No. 70601033) and Innovation Fund for Graduate Student of Cina Agricultural University (No. KYCX2010105). Corresponding autor. Email address: pingsunsine@yaoo.com.cn (Ping Zong). 1548 7741/ Copyrigt 2010 Binary Information Press December 2010

2386 K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 loss function and propose a robust support vector regression (SVR). We smoot te proposed loss function by te combination of two Huber loss functions and formulate te associated nonconvex optimization as a difference of convex functions (d. c.) program. Te d. c. algoritm (DCA) was successfully applied to a lot of different and various non-differentiable non-convex optimization problems to wic it quite often gave global solutions and proved to be more robust and efficient tan related standard metods, especially in te large-scale setting [14, 15]. We employ te concave-convex procedure [16] and develop a Newton-type algoritm to solve te robust SVR, wic can explicitly incorporate noise and outlier suppression and sparseness in te training process. Experimental results on bencmark datasets confirm te effectiveness of te proposed algoritm. Te rest of tis paper is organized as follows. Section 2 presents SVR in te primal. In section 3, we propose te non-convex loss function and te robust model. In section 4, a Newton-type algoritm is developed for solving te robust SVR. Section 5 presents te experimental results on bencmark datasets. Finally, section 6 gives te conclusions. 2 Support Vector Regression in te Primal In tis section, we briefly describe L2-SVR in te primal. Considering a regression problem wit training samples {(x i, y i )} n, were x i R d is te input sample and y i is te corresponding target, we can obtain a predictor by solving te following optimization problem: min w, b, ξ, ξ 1 2 w 2 + C (ξi 2 + ξ i 2 ) (1) s.t. w ϕ(x i ) + b y i ε + ξ i, i = 1,, n (2) y i (w ϕ(x i ) + b) ε + ξ i, i = 1,, n (3) were ϕ( ) is a nonlinear map from te input space to te feature space, C is te regularization factor wic balances te tradeoff between te fitting errors and model complexity. Program (1) (3) can be written as an unconstrained optimization in an associated reproducing kernel Hilbert space H: min f 1 2 f 2 H + C l(f(x i ) y i ) (4) were l(z) = (max (0, z ε)) 2 wit ε > 0 is te quadratic insensitive loss function. For te sake of simplicity, we can drop te bias b witout loss of generalization performance of SVR [17]. According to [17] te optimal function for (4) can be expressed as a linear combination of te training samples in te feature space f(x) = n β ik(x, x i ), were k(, ) is a kernel function. Ten we ave min L(β) = 1 β 2 β Kβ + C l(z i ) (5) were K is te kernel matrix wit K ij = k(x i, x j ), K i is te it row of K, and z i = K i β y i. Eq.(5) is te formulation of SVR in te primal.

K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 2387 3 Robust Model Noise and outliers existing in te training samples tend to cause large residuals. Hence, tey keep more influence on te optimal solution of (5), wic may result in te decisive yperplane of SVR deviating from te original position and tus deteriorate te generalization performance of SVR. We introduce a non-convex loss function to limit te impact of tem. By setting te upper bound, we get te following loss function: l θ (z) = min {θ 2, (max(0, z ε)) 2 } (6) were θ > 0 is a constant. It is easy seen tat l θ (z) can control te residuals caused by noise and outliers. However, l θ (z) is neiter convex nor differentiable, and te resultant optimization problem is difficult to be solved. To overcome tis dilemma, we first propose a smoot loss function as te approximation of l θ (z). To do so, we construct two Huber loss functions l1 u (z) and l2 u (z): 0 if z ε l1 u (z) = ( z ε) 2 if ε < z ε + θ (7) θ[2 z (2ε + θ)] if z > ε + θ 0 if z ε + θ l2 u (z) = θ( z ε θ) 2 / if ε + θ < z ε + θ + (8) θ[2 z (2ε + 2θ + )] if z > ε + θ + were > 0 is te Huber parameter. Combining l u 1 (z) and l u 2 (z), we obtain l u θ,(z) = l1 u (z) + l2 u (z) 0 if z ε ( z ε) 2 if ε < z ε + θ = θ[2 z (2ε + θ)] θ( z ε θ) 2 / if ε + θ < z ε + θ + θ 2 + θ if z > ε + θ + (9) It is easy to verify tat lθ, u (z) is continuous and differentiable. Its sape is sown in Fig. 1. Wen 0, lθ, u (z) approaces l θ(z) defined by (6). So lθ, u (z) is a smoot approximation of l θ (z). Substituting (9) into (5), we propose te robust model as follows: min L θ, (β) = 1 β 2 β Kβ + C l u θ,(z i ) (10) Note tat te objective function of (10) is non-convex. Denote u(β) = 1 2 β Kβ + C and v(β) = C l u 2 (z i ). Ten optimization problem (10) can be expressed as l u 1 (z i ) min β u(β) v(β) (11)

2388 K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 3.0 2.5 2.0 θ 2 +θ 1.5 1.0 0.5 _ 0 _ 4 2 ε θ _ ε _ θ ε 0 ε ε+θ 2 4 ε+θ+ ESV SV 2 SV 1 NSV SV 1 SV 2 ESV Fig. 1: Smoot non-convex loss function l u θ, (z) (11) is a d.c. program since u and v are convex functions. In te d.c. programming literature, te DCA [14, 15] was proposed for solving a general d.c. program of form min{u(x) v(x) : x R n } wit u and v being proper lower semi-continuous convex functions, wic form a large class of functions tan te class of differentiable functions. DCA solves two sets of convex programs called te primal and dual programs iteratively in succession suc tat te solution of te primal is te initialization to te dual and vice-verse. It is pointed out tat since tere are as many as DCA as tere are DC decompositions, te suitable coices of te DC decomposition of te objective function and te initial point are important for te computational efficiency. It can be sown tat if v is differentiable, ten DCA exactly reduces to concave-convex procedure (CCCP) [16]. Te CCCP algoritm is an iterative procedure tat solves a sequence of convex programs: x t+1 arg min x {u(x) x v(x t )}. Te resulting algoritm is proved to own global convergence beavior, i.e., for any random initialization, te sequence generated by CCCP converges to a stationary point of te d.c. program. In our program (11), since v is differentiable, we can solve it by CCCP. Te optimal solution β of (11) can be obtained by iteratively solving te following optimization problem: β t+1 = arg min β {u(β) β v(β t )} (12) were v(β t ) is te derivative of v(β t ) wit respect to β at te tt iteration: v(β t ) = v(βt ) β = C l2 u (zi) t z i z i β = C ηik t i (13) were 0 if zi t ε + θ ηi t = 2θ [(ε + θ)st i zi] t if ε + θ < zi t ε + θ + 2θs t i if zi t > ε + θ + (14)

K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 2389 { 1 if z t wit s t i = sign(zi) t = i 0 1 if zi t < 0 convex optimization problem:. In eac iteration, we only need to solve te following min L θ, (β) = u(β) + C β 4 Newton Algoritm for Robust SVR ηik t i β (15) Since (15) is a convex optimization, we can establis Newton-type algoritm to solve it. First, we divide te training samples into four groups according to z t i = K i β t y i at te tt iteration: (1) Te samples wit z t i ε are regarded as non-support vectors lying in NSV region illustrated in Fig. 1, and te number of training samples in tis region is denoted by NSV. (2) Te samples wit ε < z t i ε + θ + are regarded as support vectors. We furter divide tem into two subgroups, i.e. te samples wit ε < z t i ε + θ lying in SV 1, and te samples wit ε + θ < z t i ε + θ + lying in SV 2 region. We denote te number of samples in tese two subgroups by SV 1 and SV 2, respectively. (3) Te samples wit z t i > ε + θ + are regarded as error support vectors wo lie in ESV region sown in Fig. 1, and te number of samples in tis region is denoted by ESV. For convenience of expression, we arrange te four regions of samples in te order of SV 1, SV 2, ESV and NSV. Let I 1 and I 2 be n n diagonal matrices, were I 1 as te first SV 1 entries being 1 and te oters 0, and I 2 as te first SV 1 entries being 0, followed by te SV 2 entries being 1 and 0 for te rest. In order to develop a Newton-type algoritm for (15), we need to calculate te gradient and Hessian of te objective function of (15). Te gradient is [ L θ, (β) = Kβ + 2CK I 1 (Kβ y εs) + θi 2 s θi ] 2(z t (ε + θ)s t ) were y = [y 1,, y n ], s = [sign(z 1 ),, sign(z n )], z t = [z t 1,, z t n], and s t = [s t 1,, s t n], and te Hessian is G = K + 2CKI 1 K (17) Ten te solution β t+1 of (15) at te tt CCCP iteration can be updated by [ β t+1 = β t G 1 L θ, (β t ) = 2C(I n + 2CI 1 K) 1 I 1 (y + εs t ) θi 2 s t + θi ] 2(z t (ε + θ)s t ) (18) were I n denotes n n identity matrix. In Eq. (18), we need to calculate te inverse of I n +2CI 1 K. Notice tat it is a sparse matrix: I SV1 + 2CK SV1, SV 1 2CK SV1,SV 2 2CK SV1,ESV 2CK SV1,NSV I n + 2CI 1 K = 0 I SV2 0 0 0 0 I ESV 0 0 0 0 I NSV (16)

2390 K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 Its inverse can be derived as follows: A 2CAK SV1,SV 2 2CAK SV1,ESV 2CAK SV1,NSV (I n + 2CI 1 K) 1 0 I SV2 0 0 = 0 0 I ESV 0 0 0 0 I NSV (19) were A = (I SV1 + 2CK SV1, SV 1 ) 1. Substituting (19) into (18), we get te optimal solution at te (t + 1)t iteration A { [ y SV1 + εs t SV 1 + 2CθK SV1,SV 2 s t SV2 + ((ε + θ)s t SV 2 z t SV 2 )/ ]} β t+1 =2C θ [ s t SV 2 + ((ε + θ)s t SV 2 z t SV 2 )/ ] 0 = 0 β t+1 SV 1 β t+1 SV 2 0 0 (20) It is sown by Eq. (20) tat te samples in te ESV region ave no influence on te optimal solution because te corresponding elements in β t+1 are fixed at 0. Considering tat te noise and outliers are always lying in te ESV region, te robust SVR is muc less insensitive to tem and tus gains better generalization performance. In addition, te robust SVR also keeps te sparseness since te elements of β t+1 in NSV region are fixed at 0. Algoritm NRSVR (Newton-type algoritm for robust SVR) Given te training samples S = {(x i, y i )} n, kernel matrix K and a small positive constant ρ, te predefined constants ε, θ,. 1. Initialization: β 0 is solved using a classical SVM toolbox on a small subset of S. Let t = 0 and divide te training samples into four regions according to K i β 0 y i ; 2. Rearrange te regions in te order of SV 1, SV 2, ESV and NSV, and adjust K and y correspondingly. Calculate te gradient L θ, (β t ) and ceck weter L θ, (β t ) ρ. If so, stop; else go to te next step; 3. Compute β t+1 according to Eq. (20); 4. Spilt training samples into four regions according to K i β t+1 y i. Set t = t + 1 and go to step 2. Notice tat in te above procedure, we need not reorder K and y during te computation in step 2. In fact, we only need to remember te indices of te samples in te different groups. Wen tey are required, we may abstract te corresponding rows or columns from te original matrices or vectors. In practice, we coose te start point β 0 suc tat not all zi 0 = K i β 0 y i satisfy zi 0 ε or zi 0 > ε + θ +. Since te case tat zi 0 ε or zi 0 > ε + θ + for all i implies β t = 0, t. Te objective function L θ, (β) of (10) monotonously decreases wit respect to te sequence {β t } generated by NRSVR. In fact, if β t+1 is te optimal solution at tt iteration for (15), ten u(β t+1 ) + C ηik t i β t+1 u(β t ) + C ηik t i β t (21)

K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 2391 Since v(β) is convex function, we ave v(β t+1 ) v(β t ) v(β t ) (β t+1 β t ) = C ηik t i β t C ηik t i β t+1 (22) From (21) and (22), we obtain L θ, (β t+1 ) L θ, (β t ). In addition, obviously, L θ, (β) 0. Hence, according to te analysis in [16], NRSVR converges. Next, we discuss te computational complexity of NRSVR. Since te most time-consuming stage is to calculate te iterations, we merely consider one iteration complexity. In step 2, te complexity of computing L θ, (β) is O(n( SV 1 + SV 2 )). In step 3, te cost of updating β t is max{o( SV 1 3 ), O( SV 1 ( SV 1 + SV 2 ))}. Hence, te total computational complexity is O(n( SV 1 + SV 2 ) + SV 1 3 ), wic is comparable wit tose of algoritms wit convex loss functions [1, 13]. 5 Number Experiments and Analysis In order to verify te robustness of te proposed algoritm, we compared NRSVR wit LS-SVR and L2-SVR on several bencmark datasets. Gaussian kernel k(x i, x j ) = exp( x i x j 2 /σ 2 ) was used in te experiments. Tere exist five parameters: C, σ, ε, θ, and. LS-SVR needs to coose te prior two parameters, L2-SVR needs to coose te prior tree parameters, and te last two parameters are introduced by NRSVR. We searced te optimal parameters (C, σ, ε, θ, ) from te sets {2 10,, 2 10 } {2 10,, 2 10 } {10 3, 2 10 3, 5 10 3, 10 2, 2 10 2,, 9 10 2, 10 1 } {0.001, 0.005, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45} {0.001, 0.005, 0.01, 0.05, 0.1} by fivefold cross validation. We adopted tree popular criteria, root mean square error (RMSE), mean absolute error (MAE), and mean relative error (MRE), to evaluate te generalization performance of tese tree algoritms. All te experiments were carried on Intel Pentium IV 3.00GHz PC wit 2GB of RAM using Matlab 7.0 under Microsoft Windows XP. We test te tree algoritms on a collection of seven bencmark datasets from te UCI 1 and StatLib 2. Pyrim, Triazines, AutoMPG, and Boston ousing are taken from UCI. Pollution, Bodyfat, and Concrete are taken from StatLib. In order to test te robustness of te tree algoritms, 20% large noise was added in eac dataset. For eac dataset, some samples were randomly cosen for training, and te rest samples were employed for test. Te specific numbers are listed in TrNum and TeNum items in Table 1, respectively. We used te same training and test sets to test te tree algoritms on eac dataset. Te experimental results are summarized in Table 1. It can be seen tat NRSVR gains te best performance among te tree algoritms for all datasets. Next, we discuss te influence of parameters θ and introduced in our proposed NRSVR. is a Huber parameter used to smoot te non-convex loss function, and its value is usually small. For our experience, = 10 3 is appropriate. Parameter θ is introduced to limit te upper bound of loss function. In general, it sould not be too large or too small. If te value of θ is too large, noise and outliers can be easily treated as support vectors, wic will not only reduce te prediction accuracy of NRSVR, but also aggravate te testing burden because of more support vectors appearing in te optimal solution. If θ is too small, some normal samples are taken as outliers in te training pase and do not take part in determining te decision yperplane. Tis results in 1 Available from URL: ttp://arcive.ics.uci.edu/ml/. 2 Available from URL:ttp://lib.stat.cmu.edu/datasets/.

2392 K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 Table 1: Experimental results on bencmark datasets Dataset Algoritm RMSE MAE MRE TrNum TeNum LS-SVR 44.2928 34.0039 0.0357 40 20 Pollution L2-SVR 45.2703 35.0242 0.0368 40 20 NRSVR 39.4165 30.2575 0.0320 40 20 LS-SVR 0.0772 0.0535 0.1191 50 24 Pyrim L2-SVR 0.0805 0.0539 0.1199 50 24 NRSVR 0.0757 0.0508 0.1108 50 24 LS-SVR 0.1301 0.0997 0.2285 150 36 Triazines L2-SVR 0.1287 0.0993 0.2275 150 36 NRSVR 0.1274 0.0981 0.2261 150 36 LS-SVR 0.0094 0.0075 0.0071 200 52 Bodyfat L2-SVR 0.0088 0.0071 0.0067 200 52 NRSVR 0.0034 0.0023 0.0022 200 52 LS-SVR 3.0728 2.2499 0.0992 300 92 AutoMPG L2-SVR 2.9373 2.2137 0.0996 300 92 NRSVR 2.7882 2.0508 0.0895 300 92 LS-SVR 4.2619 2.9092 0.1429 300 206 Boston ousing L2-SVR 4.3830 3.1184 0.1548 300 206 NRSVR 3.9520 2.6975 0.1358 300 206 LS-SVR 8.3216 6.4423 0.2399 500 530 Concrete L2-SVR 6.9837 5.2011 0.1843 500 530 NRSVR 6.9114 5.1502 0.1836 500 530 poor generalization performance. Terefore, we need to find a suitable value wo can suppress te impact of outliers wile at te same time keep te good generalization performance. We took Pollution and Pyrim datasets as examples to illustrate te influence of tese two parameters. Wen one parameter is analyzed, te rest parameters are fixed. Te effects of θ and on te RMSE values for te two datasets are sown in Figs. 2 and 3, respectively. Te results validate te above analysis. 6 Conclusion In tis paper, we propose a non-convex and non-differentiable loss function and develop a robust support vector regression wic as strong ability of suppressing te impact of noise and outliers and also keeps te sparseness. We construct a smoot loss function wic is te combination of two Huber loss functions to approximate te non-convex loss function. Te resultant optimization problem can be formulated as a d. c. program. We employ te concave-convex procedure and develop a Newton-type algoritm to solve it. Numerical experiments on te bencmark datasets sow te effectiveness of te proposed algoritm. In tis paper, we only focus on constructing te robust model based on te truncated quadratic loss function. Furter researc is required for discussing te general form of non-convex loss

K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 2393 function to establis a general robust model. RMSE 100 90 80 70 60 50 40 30 0.050.100.150.200.250.300.350.400.450.50 RMSE 250 200 150 100 50 0 0 0.02 0.04 0.06 0.08 0.10 Fig. 2: Influence of θ (left grap) and (rigt grap) on RMSE values for Pollution RMSE 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0 0.1 0.2 0.3 0.4 0.5 RMSE 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0 0.02 0.04 0.06 0.08 0.10 Fig. 3: Influence of θ (left grap) and (rigt grap) on RMSE values for Pyrim References [1] N. Cristianini and J. Sawe-Taylor, An Introduction to Support Vector Macines, Cambridge University Press, 2000 [2] B. Scölkopf and A. J. Smola, Learning wit kernels, MIT Press, 2002 [3] V. N. Vapnik, Te nature of statistical learning Teory, Springer-Verlag, New York, 1995 [4] J. Suykens, J. DeBrabanter, and L. Lukas, Weigted least squares support vector macines: robustness and sparse approximation, Neurocomputing 48 (2002) 85-105 [5] C. Lin and S. Wang, Fuzzy support vector macines, IEEE Transactions on Neural Networks 13 (2002) 464-471 [6] H. Huang and Y. Liu, Fuzzy support vector macines for pattern recognition and data mining, International Journal of Fuzzy Systems 4 (2002) 3-12 [7] R. Collobert, F. Sinz, J. Weston, and L. Bottou, Trading convexity for scalability, in: Proceedings of te 23rd International Conference on Macine Learning, ACM Press, 2006, pp. 201-208

2394 K. Wang et al. /Journal of Information & Computational Science 7: 12 (2010) 2385 2394 [8] L. Xu, K. Crammer, and D. Scuurmans, Robust support vector macine training via convex outlier ablation, in: Proceedings of te 21st National Conference on Artificial Intelligence, 2006, pp. 536-546 [9] S. Yang and B. Hu, A stagewise least square loss function for classification, in: Proceedings of te 2008 SIAM International Conference on Data Mining, IEEE 2008, pp. 120-131 [10] B. Trafalis, Gilbert C. Robust classification and regression using support vector macines, European Journal of Operational Researc 173 (2006) 893-909 [11] P. Zong, M. Fukusima, Second order cone programming formulations for robust multi-class classification, Neural Computation 19 (2007) 258-282 [12] P. Zong, L.Wang, Support vector regression wit input data uncertainty, International Journal of Innovative Computing, Information and Control 4(2008) 2325 2332 [13] J. A. K. Suykens and J. Vandewalle, Becmarking least squares support vector macine clssifiers, Macine Learning 54 (2004) 5-32 [14] P. D. Tao and L. T. H. An, D. C. optimization algoritms for solving te trust region subproblem, SIAM Journal of Optimization 8 (1998) 476-505 [15] L. T. H. An. and P.D. Tao, Te DC (difference of convex functions) programming and DCA revisited wit DC models of real world nonconvex optimization problems, Annals of Operations Researc 133 (2005) 23-46 [16] A. L. Yuille and A. Rangarajan Te concave-convex procedure, Neural Computation 15 (2003) 915-936 [17] O. Capelle, Training a support vector macine in te primal, Neural Computation 19 (2007) 1155-1178