Financial market forecasting using a two-step kernel learning method for the support vector regression

Transcription

1 Ann Oper Res (2010) 174: DOI /s Fnancal market forecastng usng a two-step kernel learnng method for the support vector regresson L Wang J Zhu Publshed onlne: 28 May 2008 Sprnger Scence+Busness Meda, LLC 2008 Abstract In ths paper, we propose a two-step kernel learnng method based on the support vector regresson (SVR) for fnancal tme seres forecastng. Gven a number of canddate kernels, our method learns a sparse lnear combnaton of these kernels so that the resultng kernel can be used to predct well on future data. The L 1 -norm regularzaton approach s used to acheve kernel learnng. Snce the regularzaton parameter must be carefully selected, to facltate parameter tunng, we develop an effcent soluton path algorthm that solves the optmal solutons for all possble values of the regularzaton parameter. Our kernel learnng method has been appled to forecast the S&P500 and the NASDAQ market ndces and showed promsng results. Keywords Fnancal market forecastng Kernel learnng LAR/LASSO Non-negatve garrote Support vector regresson 1 Introducton Forecastng the fnancal market s a major challenge n both academa and busness. Because of the nosy nature, fnancal tme seres are among the most dffcult sgnals to forecast, whch naturally leads to the debate on market predctablty among the academcs and market practtoners. The effcent market hypothess (Fama 1970, 1991) and the random walk hypothess (Malkel 1973) are major theores n economcs and fnance. The hypotheses state that the fnancal market evolves randomly and no excess returns can be made by predctng and tmng the market. Accordng to these hypotheses, t s mpossble to consstently outperform the market, and the smple buy-and-hold s the best nvestment strategy. L. Wang wll jon Barclays Global Investors, San Francsco, CA 94105, USA. L. Wang Ross School of Busness, Unversty of Mchgan, Ann Arbor, MI 48109, USA e-mal: wang@umch.edu J. Zhu ( ) Department of Statstcs, Unversty of Mchgan, Ann Arbor, MI 48109, USA e-mal: jzhu@umch.edu

2 104 Ann Oper Res (2010) 174: Many economsts and professonal nvestors, however, dspute these hypotheses, and they beleve that the fnancal market s predctable to some degree. Lo et al. (2000) showed the evdence of the predctablty for the fnancal market usng techncal analyss and computatonal algorthms. Furthermore, Lo and MacKnlay (2001) went through a seres of tests and demonstrated the exstence of trends and patterns n the fnancal market. Hrshlefer (2001) provded a survey of emprcal evdence for the captal market neffcency and revewed the explanaton for these fndngs from the behavoral fnance perspectve. Researchers n the machne learnng and data mnng communty have also tred to forecast the fnancal market, usng varous learnng algorthms. The support vector machne (SVM) s one of the tools that have shown promsng results (Cao and Tay 2001, 2003, 2001; Km2003; Huang et al. 2005). The SVM s a kernel based learnng method. Kernel methods embed the orgnal nput space X nto a Hlbert space F va kernel functons and look for lnear relatons n F, whch correspond to nonlnear relatons n the orgnal space X. Kernel functons usually contan tunng parameters. The values of the tunng parameters are crucal for the performance of the kernel methods. If a kernel functon s gven, the assocated parameter values can be tuned by ether tral-and-error or heurstc search, e.g., the gradent-based method (Chapelle et al. 2002; Keerth et al. 2007) and the evolutonary method (Fredrchs and Igel 2005; Nguyen et al. 2007). However, n fnancal market forecastng, n addton to tune the parameter for a gven kernel functon, another relevant ssue s how to combne dfferent kernel functons n order for the traders to make better decsons. For example, traders often observe nformaton from varous sources, ncludng techncal ndcators, economc ndces, poltcal news, etc. A kernel functon may be derved from each nformaton source. Then a natural queston s how to combne these nformaton (kernels) to help the trader make better forecast decsons. Ths problem s referred to as kernel learnng or kernel selecton n the lterature. Tradtonal kernel methods rely on a sngle kernel functon. On the other hand, kernel learnng often seeks for a lnear combnaton of canddate kernels. The canddate kernels can be obtaned from dfferent kernel functons, parameter settngs and/or data sources. In ths paper, we propose a two-step kernel learnng method based on the support vector regresson (SVR) for fnancal market forecastng. In the frst step, we solve a standard SVR problem usng all canddate kernels smultaneously; n the second step, we use scalng parameters, one for each canddate kernel, to acheve kernel learnng. The L 1 -norm regularzaton s used to adjust the scalng parameters. Snce the regularzaton parameter has to be selected carefully, to facltate parameter tunng, we develop an effcent soluton path algorthm, whch solves the ftted model for every possble value of the regularzaton parameter. Then the best parameter value can be dentfed usng a parameter selecton crteron or a valdaton dataset. Several other kernel learnng algorthms have also been proposed recently based on dfferent theores and frameworks. Lanckret et al. (2004), Bach et al. (2004) andongetal. (2005) formulated kernel learnng as semdefnte programmng (SDP) problems. Fung et al. (2004) and Km et al.(2006) proposed methods mplementng kernel learnng by teratvely solvng quadratc programmng problems. Wang et al. (2006) used an orthogonal least square forward selecton procedure to acheve a sparse representaton of the kernel. Crstann et al.(2006) proposed a quantty measurng the algnment between a kernel and the data, and they developed algorthms to adapt the kernel matrx to the learnng task. Our method, however, s based on a dfferent framework, whch s nspred by the non-negatve garrote method (Breman 1995). The non-negatve garrote uses scalng parameters for varable selecton n lnear models; we use the smlar dea for selectng kernels n our two-step kernel learnng method.

3 Ann Oper Res (2010) 174: The rest of the paper s organzed as follows: n Sect. 2, we descrbe our method and the soluton path algorthm; n Sect. 3, we compare the performance of dfferent methods based on hstorcal data of the S&P500 and the NASDAQ market ndces; and we conclude the paper n Sect Two-step kernel learnng for the support vector regresson 2.1 Support vector regresson We frst brefly revew the SVR and refer nterested readers to Vapnk (1995) and Smola and Schoelkopf (2004) for detals. Let {x 1, x 2,...,x n } represent the n nput vectors, where x R p,and{y 1,y 2,...,y n } be the correspondng output, where y R.LetΦ : X F be a mappng functon, whch transfers a vector n the orgnal space X nto a Hlbert space F. The SVR solves the followng problem: mn β 0,β,ξ l(ξ ) + λ 2 β 2 2 (1) =1 subject to ξ = y β 0 β T Φ(x ), = 1,...,n. (2) In the above setup, l( ) s a loss functon and λ 0 s a regularzaton parameter, whch controls the trade-off between the loss and the complexty of the ftted model. Fgure 1 shows some of the popular loss functons that have been used for the SVR: the Laplace, the ε-nsenstve, the quadratc and the Huber loss functons. The Laplace loss functon s more robust to outlers than the quadratc loss functon and t requres fewer parameters than the ε-nsenstve and the Huber loss functons. Therefore, n ths paper we focus on the Laplace loss functon. However, we note that our method can be naturally appled to other loss functons as well after straghtforward modfcatons. Wth the Laplace loss functon, (1) (2) can be wrtten as: mn β 0,β,ξ =1 subject to (ξ + + ξ ) + λ 2 β 2 2 ξ y β 0 β T Φ(x ) ξ +, ξ +,ξ 0, = 1,...,n. Fg. 1 Commonly used loss functons n the SVR

4 106 Ann Oper Res (2010) 174: The correspondng Lagrangan functon s: L P = =1 (ξ + =1 + ξ ) + λ 2 β =1 γ (y β 0 β T Φ(x ) + ξ ) γ + (y β 0 β T Φ(x ) ξ + ) =1 ρ + ξ + =1 ρ ξ, where γ +, γ, ρ +, ρ are non-negatve Lagrangan multplers. From the Lagrangan functon, the orgnal optmzaton problem can be transformed nto ts dual problem: max 1 Kα + y α 2λ αt T α (3) subject to α = 0, =1 1 α 1, = 1,...,n, where K R n n s a symmetrc postve defnte matrx, wth the (, ) entry K, = Φ(x ) T Φ(x ) = K(x, x ). Let α represent ts optmal soluton, then the prmal varables can be recovered as: β = 1 λ α Φ(x ). =1 Notce that for any nput vector x, the ftted model s: f(x) = β λ α K(x, x). (4) =1 Therefore, by usng the kernel functon K(, ), we can solve the SVR problem wthout explctly defnng the mappng functon Φ( ). For example, radal bass and polynomal kernels are commonly used kernel functons: Radal : K(x, x ) = exp( x x 2 /σ 2 ) (5) Polynomal : K(x, x ) = (1 + x, x ) d (6) where σ and d are user-specfed parameters. 2.2 Model for a two-step kernel learnng method The standard SVR uses only a sngle mappng functon Φ( ) or ts correspondng kernel functon K(, ). In our settng, we consder multple mappngs: Φ 1 ( ),...,Φ m ( ), wherem s the number of canddate mappngs avalable. Let K 1 (, ),...,K m (, ) be the correspondng kernel functons. We look for a sparse lnear combnaton of these kernels that can be used to predct accurately on future data. For example, n the fnancal market forecastng settng, K 1 (, ),...,K m (, ) may correspond to dfferent nformaton sources. Our kernel learnng method results n a new kernel functon K (, ) = s 1 K 1 (, ) + +s m K m (, ),

5 Ann Oper Res (2010) 174: where s 1,...,s m 0 are the assocated combnaton coeffcents. Our goal s to learn s 1,...,s m, and the method we propose conssts of two steps. In the frst step, we solve an SVR problem that uses all canddate mappngs (or kernels) smultaneously: mn β 0,β j,ξ +,ξ subject to =1 (ξ + + ξ ) + λ 2 ξ y β 0 m β j 2 2 (7) m ξ +,ξ 0, = 1,...,n. β T j Φ j (x ) ξ +, Smlar to the standard SVR, we can transform t nto ts dual format: max 1 ( m K α 2λ αt j )α + y T α (9) subject to α = 0, =1 1 α 1, = 1,...,n, where the (, ) entry of the kernel matrx K j R n n s defned as K j (x, x ). Notce that ths problem s exactly the same as the standard SVR, except that the kernel matrx K n (3) s replaced by the summaton of canddate kernel matrces ( m K j ). Notce that the above dual s a quadratc programmng (QP) problem, and we can solve t effcently usng the sequental mnmal optmzaton (SMO) method (Platt 1999) or the soluton path algorthm (Gunter and Zhu 2006). Let α be ts optmal soluton, then each coeffcent vector β j can be wrtten as: β j = 1 λ (8) α Φ j (x ), j = 1,...,m. (10) =1 In the second step, we ntroduce scalng parameters s j (j = 1,...,m), one for each of the canddate mappngs (or kernels), and consder β new j = s j β j, j = 1,...,m. (11) To solve for s j, we consder to regularze the L 1 -norm of s j (Tbshran 1996; Bradleyand Mangasaran 1998),.e.: mn s j,β 0,ξ +,ξ subject to =1 (ξ + + ξ ) (12) ξ y β 0 m s j β T j Φ j (x ) ξ +,

6 108 Ann Oper Res (2010) 174: m s j C, (13) s j 0, j = 1,...,m, (14) ξ +,ξ 0, = 1,...,n, (15) where C 0 s a regularzaton parameter. Notce that n the second step, β j are known and fxed, and we are nterested n solvng for β 0 and s j. Denote the soluton as β 0 and ŝ j (j = 1,...,m); usng (10)and(11), we then have the fnal ftted model: m f(x) = β 0 + ŝ j β T j Φ j (x) = β 0 + m ŝ j ( 1 λ The fnal ftted model can be further wrtten as: f(x) = β λ wth the correspondng learned new kernel: K (, ) = ) α K j (x, x). =1 α K (x, x), (16) =1 m ŝ j K j (, ). (17) In (13), we apply the L 1 -norm regularzaton dea (Breman 1995; Tbshran 1996; Bradley and Mangasaran 1998), whch has been used for varable selecton n the settng of lnear regresson: the sngulartes of the constrants m s j C and s j 0tendtogenerate a sparse soluton for s,.e., some of the s j s are estmated to be exact zero. The dea s smlar to the non-negatve garrote method (Breman 1995), whch uses scalng parameters for varable selecton n the settng of lnear regresson. In our method, we use scalng parameters to select kernels. If an ŝ j s equal to zero, the correspondng kernel s removed from the estmated combned kernel (17), hence the method mplements automatc kernel selecton or kernel learnng. The ntuton s that f the jth mappng (or kernel) s not mportant for predcton, the correspondng β T j Φ j (x) tends to have a small magntude, hence the s j gets heavly penalzed and tends to be estmated as zero, whle f the jth mappng (or kernel) s mportant for the predcton, β T j Φ j (x) tends to have a bg magntude, hence s j gets lghtly penalzed and tends to be estmated as non-zero. 2.3 The soluton path algorthm The regularzaton parameter C controls the bas-varance trade-off. Increasng the value of C tends to select more kernels and generates a more complcated model, whch reduces the bas but ncreases the varance of the ftted model; and vce versa when decreasng the value of C. Hence the value of C s crtcal for the performance of the ftted model. In practce, people can pre-specfy a number of values for C, solve the optmzaton problem for each of them, and use a valdaton set or some model selecton crteron to choose the

7 Ann Oper Res (2010) 174: best model. Instead of usng such a tral-and-error approach, we develop an effcent soluton path algorthm that computes the solutons for all possble values of C, whch facltates the selecton for an optmal value of C. Startng from C = 0, the algorthm contnuously ncreases C and calculates ŝ j and β 0 along the path, untl the soluton does not further change wth C. The algorthm takes advantage of the pecewse lnearty property of ŝ j and β 0 wth respect to C. We acknowledge that our algorthm s nspred by the LAR/LASSO method (Efron et al. 2004) and the general pecewse lnear soluton path strategy of Rosset and Zhu Smlar deas have also been used n the SVM to generate the optmal soluton path of the regularzaton parameter for over-fttng control (Haste et al. 2004; Gunter and Zhu 2006) or of hyperparameters for the kernel functon (Wang et al. 2007). However, we make a note that the algorthm we develop here s sgnfcantly dfferent from the earler algorthms, because we are now dealng wth a non-dfferentable loss functon and a nondfferentable penalty, whle all the earler algorthms deal wth cases where ether the loss functon s dfferentable or the penalty s dfferentable. The overall computatonal cost of our soluton path algorthm s approxmately O(max(n, m)m 2 ). For expostonal clarty, we delay the detals of the algorthm n the appendx. 3 Fnancal market forecastng In ths secton, we apply our two-step kernel learnng method to forecast the fnancal market and compare ts performance to three other methods, specfcally, the buy-and-hold strategy, the standard SVR and the kernel learnng method proposed by Lanckret et al. (2004). We chose to compare wth Lanckret et al. (2004) smply because none of the current kernel selecton methods have ther code publcly avalable, and the method n Lanckret et al. (2004) seems to be the most convenent for mplementaton (and t also works well n practcal problems). The experments were based on hstorcal prces of the S&P500 and the NAS- DAQ composte ndces. All these methods were tested on 2,500 tradng days from 11/1997 to 10/2007, whch covers for around 10 years. From the daly closng prces of the S&P500 and the NASDAQ ndces, we frst calculated the 5 (weekly), 10 (bweekly), 20 (monthly), and 50-day (quarterly) movng averages, whch are wdely used techncal ndcators by traders. To llustrate our preprocessng procedure, let P t denote the closng prce on day t and A t,t be the T -day movng average on day t, wherea t,t s calculated as: A t,t = 1 T t k=t T +1 P k, for T = 5, 10, 20, 50. Notce that P t can also be consdered as A t,1. Then we calculated the movng average logreturn R t,t (ncludng the daly log-return) for each day t: R t,t = ln(a t,t ) ln(a t T,T ), for T = 1, 5, 10, 20, 50. We chose to use the log-return, whch s a more wdely used measure n fnance than the raw return, because the return tends to have a log-normal dstrbuton, whle the log-return has an approxmate normal dstrbuton. Snce the log-return was used n our experments, the cumulatve log-return over a certan perod was smply the summaton of the daly logreturns over ths perod. Overall we had 5 tme seres: (1) R t,1,(2)r t,5,(3)r t,10,(4)r t,20, and (5) R t,50. The output varable was R t+1,1, the next day log-return.

8 110 Ann Oper Res (2010) 174: Then we constructed the nput features. For the jth tme seres, we extracted p j data ponts to construct the nput features. Specfcally, let x j t =[R t (pj 1)T j,t j,r t (pj 2)T j,t j,...,r t,tj ], for j = 1,...,5, where T 1 = 1, T 2 = 5, T 3 = 10, T 4 = 20 and T 5 = 50. Notce that as T j ncreases, the correspondng tme seres R t,tj becomes smoother, hence fewer data ponts were needed for x j t. Specfcally, we let p 1 = 10, p 2 = 8, p 3 = 6, p 4 = 4andp 5 = 4. The overall nput features for day t were then: x t =[x 1 t, x2 t, x3 t, x4 t, x5 t ]. The ntuton s that dfferent x j t represent dfferent characterstcs underlyng the fnancal market. For example, x 1 t and x 2 t capture the short-term (daly and weekly) behavors of the market, whle x 4 t and x 5 t capture the long-term (monthly and quarterly) trends. It s not clear a pror whch features are mportant for predctng the next day return, nether how they should be combned to predct. We treated these features separately by applyng separate kernels to them and used our method to automatcally select and combne them to help us decde whether we should long or short for the next day. Snce the radal bass kernel s the most popular kernel that gets used n practcal problems and t has also been proved to be useful n fnancal market forecastng, for example, see Cao and Tay (2003), Km (2003)andHuangetal.(2005), we chose to use the radal bass kernel (5). The radal bass kernel contans a scalng parameter σ 2, whch controls the effectve support of the kernel functon. To llustrate our method, we consdered two possble values for σ 2. Specfcally, for each feature x j t (whch s a p j -vector, j = 1,...,5), we consdered two canddate radal bass kernel functons: K j (x j t, x j t ) = exp( x j t x j t 2 /m 2 j ) and K j (x j t, x j t ) = exp( x j t x j t 2 /(10m 2 j )), where m 2 j s the medan value of xj t x j 2. Therefore, there were a total of 5 2 = 10 canddate kernel functons. In evaluatng the ftted model, we used a smple strategy to nvest: let f t be the forecasted next day log-return on day t, f f t 0, we buy at the closng prce on day t; otherwse, we short t. Then the log-return, R t, assocated wth day t based on our strategy, can be computed as: { Rt+1,1, f R t+1,1 f t 0; R t = R t+1,1, otherwse. Fgure 2 llustrates how we selected the tunng parameters λ n (9) andc n (12) for our two-step procedure and how we evaluated the ftted model f from our two-step procedure. Startng from 11/1997, we frstly traned models (va the soluton path algorthm) on 150 consecutve data ponts (the tranng set); secondly, we appled the obtaned models on the next 10 data ponts (the valdaton set) and selected the values of λ and C that had the best performance; thrdly, we combned 140 data ponts of the tranng set and all 10 data ponts of the valdaton set nto a new set, whch we call the true tranng set, and traned the fnal model usng the selected values of λ and C; lastly, we appled the fnal model on another 10 data ponts (the test set) and recorded ts performance. We then shfted forward for 10 data ponts (or 10 tradng days) and repeated the same procedure. We repeated ths for 250 tmes, whch correspond to 2,500 tradng days untl 10/2007. Notce that there was no overlap between any two test sets.

9 Ann Oper Res (2010) 174: Fg. 2 Tranng, valdaton and test sets Table 1 Performance on the S&P500 and NASDAQ ndces. Average bweekly return s the average of log-returns over 250 dfferent test sets, and the numbers n the parentheses are the correspondng standard errors. p-values were computed from the pared t-test aganst the buy-and-hold strategy Method Average bweekly p-value log-return (%) S&P500: Buy-and-hold (0.217) SVR (0.216) Kernel learnng (0.231) Two-step procedure (0.231) NASDAQ: Buy-and-hold (0.351) SVR (0.374) Kernel learnng (0.339) Two-step procedure (0.361) For the standard SVR, we used x t as the nput feature, whch contans nformaton from dfferent tme seres wthout dfferentatng them, and consdered one sngle radal bass kernel K(x t, x t ) = exp( x t x t 2 /σ 2 ). There were two tunng parameters, the λ as n (3) andtheσ.forλ, we used the soluton path algorthm that was developed n Gunter and Zhu (2006), whch solves solutons for all possble values of λ, and the optmal λ was selected usng a valdaton set. For σ, we frst computed the medan of x t x, wherex t s the nput feature vector of the tth tranng observaton and x s the correspondng mean. Denote t as m x. Then we searched the optmal σ over (0.2m x, 0.4m x,...,2m x ).Forthe kernel learnng method by Lanckret et al. (2004), we selected the tunng parameters n a smlar fashon. All selected fnal models were evaluated on the same test sets. Snce there are 5 tradng days per week, the summaton of log-returns on each test set represents the bweekly log-return. Table 1 shows the average bweekly log-returns among 250 test sets on the two markets from four dfferent methods. The average bweekly logreturns of our method are 4.3 and 5.6 tmes of the buy-and-hold strategy on the S&P500 and the NASDAQ, respectvely. We also used the pared t-test to compare the log-returns of our method aganst the buy-and-hold strategy. The p-value for the S&P500 s 0.03, and the p-value for the NASDAQ s The standard SVR method and the kernel learnng method by Lanckret et al. (2004) also performed better than the buy-and-hold strategy, but the mprovements are not statstcally sgnfcant based on the pared t-test. Fgures 3 and 4 show the cumulatve log-returns for all these methods on the two markets. We can see that the log-returns of our method have a consstently ncreasng trend over the 2,500 tradng day perod.

10 112 Ann Oper Res (2010) 174: Fg. 3 Cumulatve log-returns of the four methods on the S&P500 ndex Fg. 4 Cumulatve log-returns of the four methods on the NASDAQ ndex

11 Ann Oper Res (2010) 174: Table 2 Kernel selecton on the S&P500 ndex. Each row corresponds to a canddate kernel; the subscrpt 1 corresponds to σ 2 = m 2 j and the subscrpt 2 corresponds to σ 2 = 10m 2 j ; the superscrpt ndcates the part of the nput features that the kernel s based on. The second column contans the average estmated combnng coeffcent s j over the 250 dfferent trals, and the numbers n the parentheses are the correspondng standard errors. The thrd column records the selecton frequency for each kernel out of 250 trals Kernel Two-step procedure Kernel learnng s j Frequency s j Frequency K x (0.2373) 250/ (0.0269) 250/250 K x (0.4015) 154/ (0.0013) 2/250 K x (0.4361) 218/ (0.0267) 82/250 K x (0.3884) 86/ (0.0000) 0/250 K x (0.5506) 157/ (0.0053) 24/250 K x (0.3474) 49/ (0.0000) 0/250 K x (0.5591) 97/ (0.0005) 5/250 K x (0.3722) 47/ (0.0000) 0/250 K x (0.4950) 74/ (0.0018) 4/250 K x (0.3722) 51/ (0.0000) 0/250 To further nvestgate the effects of dfferent characterstcs of the fnancal tme seres, we also recorded how each kernel was selected. Table 2 summarzes the results for our method and the kernel learnng method n Lanckret et al. (2004) on the S&P 500 ndex. As we can see, the two methods behaved very dfferently on ths partcular dataset. For our twostep kernel learnng method, overall, the short-term (daly and weekly) trends had a bgger mpact than the long-term (monthly and quarterly) trends n predctng the next day return, however, the long-term trends also played a sgnfcant role. On the other hand, the kernel learnng method by Lanckret et al. (2004) tended to generate a more sparse model: Most of the tme, t only selected the kernel that contaned the daly nformaton. Regardng the scalng parameter of the radal bass kernel, the value m 2 j seemed to be preferred over 10m2 j. The results on the NASDAQ ndex are smlar. 4 Dscusson In ths paper, we have proposed a kernel learnng method based on the support vector regresson for fnancal market forecastng. Our method conssts of two steps, where a smlar dea was used by non-negatve garrote (Breman 1995) for varable selecton n the settng of lnear regresson. In the frst step, we ft a standard SVR usng all canddate kernels; n the second step, we use scalng parameters, one for each canddate kernel, and search for a sparse lnear combnaton of the kernels va the L 1 -norm regularzaton. For the second step, we have also developed an effcent soluton path algorthm that solves the optmal solutons for all possble values of the regularzaton parameter. Our two-step kernel learnng method shows promsng results n forecastng the fnancal market. The tradng strategy based on our method consstently outperforms the market, and the excess returns are statstcally sgnfcant. Readers mght be curous why we are publshng ths work nstead of makng money for ourselves from the fnancal market. The man reason s that n the current experment, transacton costs have not been consdered. Suppose that the transacton cost takes off 0.1% of the captal for each trade, then overall t wll

12 114 Ann Oper Res (2010) 174: elmnate 250% of the return over the 2,500-day tradng perod, whch makes our method unproftable. One may also argue that the transacton cost mght become gnorable f our tradng volume s large enough. However, large orders can also have mpact on the market prce and we do not have related data to dscover the range of order szes, wthn whch we do not change the market behavor at ts closng tme. Although our method and the current tradng strategy can not brng up economc benefts, they dd demonstrate that the fnancal market s not completely effcent and s predctable to some degree, whch s algned wth the conclusons made by many prevous work. Appendx In the appendx, we descrbe the detals of the soluton path algorthm that computes the solutons of (12) (15) for all possble values of C. Problem setup To smplfy the notaton, we defne: =1 f j (x) = β T j Φ j (x) = 1 λ Then the Lagrangan functon of (12) (15) s: ( m ) m L P = (ξ + + ξ ) + η s j C ε j s j + =1 =1 γ + γ ( m y β 0 ( m y β 0 α K j (x, x). (18) =1 s j f j (x ) ξ + s j f j (x ) + ξ ) ) =1 ρ + ξ + =1 ρ ξ, where γ +,γ,η,ε j,ρ +,ρ 0 are Lagrangan multplers. Let γ = γ + ( = 1,...,n), then the Karush-Kuhn-Tucker (KKT) condtons are: γ β 0 : γ = 0, (19) =1 : γ f j (x ) + η ε j = 0, s j =1 (20) ξ + : 1 γ + ρ + = 0, (21) ξ : 1 γ ρ = 0. (22)

13 Ann Oper Res (2010) 174: Let f(x ) = β 0 + m s j f j (x ), then the complementary slackness condtons are: γ + (y f(x ) ξ + ) = 0, (23) γ (y f(x ) ξ ) = 0, (24) ( m ) η s j C = 0, (25) ρ + ξ + = 0, (26) ρ ξ = 0, (27) ε j s j = 0. (28) Note that all the Lagrangan multplers are non-negatve. Let ξ = ξ + ρ, condtons (21) to(28) lead to the followng relatonshps: ρ + y f(x )>0: γ = 1, ξ > 0; y f(x )<0: γ = 1, ξ < 0; y f(x ) = 0: 1 γ 1, ξ = 0. Usng these relatonshps, we can defne the followng sets: R = { : y f(x )>0, γ = 1} (the Rght porton of the Laplace loss functon) E = { : y f(x ) = 0, 0 γ 1} (the Elbow of the Laplace loss functon) L = { : y f(x )<0, γ = 1} (the Left porton of the Laplace loss functon) A = {j : s j 0} (the Actve set of the kernels) ξ and ρ = Note that f L or R, thenγ s known; only when E, theγ becomes unknown. From the KKT condtons and the derved relatonshps, we can get two lnear systems for the prmal and dual varables. We call the followng equatons as the prmal system: y β 0 j A s j f j (x ) = 0, E, (29) s j = C. (30) j A Accordng to (28), f j A (s j 0), then ε j = 0. So we also have the followng dual system: γ + R L =0, (31) E γ f j (x ) η = 0, j A. (32) =1 For any optmal soluton at any C, the prmal and the dual systems must hold; and vce versa, f we solve these systems, we obtan the optmal soluton. Note that n the prmal system there are A +1 unknowns and E +1 equatons; n the dual system, there are

14 116 Ann Oper Res (2010) 174: E +1 unknowns and A +1 equatons. Therefore, to satsfy both systems, we must have A = E. Intal condtons Intally, the algorthm starts from C = 0, where s j = 0 and f(x ) = β 0. Therefore, (12) (15) s reduced to an optmzaton problem wth only one parameter β 0. Snce y 1,...,y n are real numbers, we assume that ther values are dstnct. Let y (1),y (2),...,y (n) be the output values n ascendng order, then the soluton for the ntal β 0 s: β 0 = y ( n/2 ), where s the celng functon. Accordng to the defntons of L, R and E, wehave L = n/2 1, R = n n/2 and E = 1. Based on the derved relatonshps, f y > β 0 then γ = 1; f y < β 0 then γ = 1. Let denote the pont n E,.e., E = { }.From(19), we get: f n s odd, then γ = 0; otherwse, γ = 1. When C ncreases from 0, some s j wll become non-zero and jon A. IfC s ncreased by an nfntesmal amount: C = C 0 +, exactly one s j wll be added nto A, because we currently have E =1, and A = E has to hold. To determne whch s j wll jon A, we consder the followng problem: mn β 0,j {1,...,m} y β 0 C f j (x ). (33) =1 Accordng to the derved relatonshps n Problem setup, problem (33) s equvalent to: mn β 0,j {1,...,m} γ (y β 0 C f j (x )). (34) =1 Let A = {j },andj s determned by: j = argmax j {1,...,m} γ f j (x ). Now, we have ntal β 0, A, L, R and E, and we also need the ntal value for the nonnegatve dual varable η as C 0 +. Accordng to (32), t s determned by: η = =1 γ f j (x ). =1 In the followng, we use the superscrpt l to ndcate the teraton number. After the ntal setup, l = 0andL l, R l, E l, A l, s l j,γl,βl 0 and ηl are all avalable. Soluton path When C ncreases by an nfntesmal amount, the sets L, R, E and A do not change due to ther contnuty wth respect to C. Therefore the structure of the prmal system

15 Ann Oper Res (2010) 174: (29) (30) does not change. The rght dervatves of the prmal varables wth respect to C can be obtaned by: β 0 C + s j C f j (x ) = 0, E (35) j A j A s j = 1. (36) C Snce E = A, β 0 C and s j are constant and can be unquely determned, assumng the C lnear system s non-sngular. Therefore, f C ncreases by a small enough amount, the prmal varables and ftted values are lnear functons of C: β 0 = β l 0 + β 0 C (C Cl ), s j = sj l + s j C (C Cl ), f(x ) = f l (x ) + β 0 C (C Cl ) + s j C f j (x )(C C l ). j A IfthencreasenC s large enough, some of the sets L, R, E and A may change. Specfcally, one of the followng two events may occur n the prmal system: a pont leaves L or R and jons E,.e., ts resdual y f(x ) changes from non-zero to zero; or an actve s j becomes nactve,.e., s j reduces from a postve value to zero. To determne C for the frst event, we calculate for every / E: { f l } (x ) Ĉ = max β 0 C + s j f j A C j (x ), 0. For the second event, we calculate { C j = max sj l / s } j C, 0, for j A. Therefore, the overall C s C = mn{ Ĉ ( / E, Ĉ > 0), C j (j A, C j > 0)}, and we get the updated varable s l+1 j,β l+1 0 and C l+1.ifthefrstevent occurs, E wll ncrease by 1; f the second event occurs, A wll decrease by 1. Therefore, t always leads to E = A +1 after an event occurs. Snce the relaton E = A does not hold, we have to restore t by takng one of the two actons: removng a pont from E; or addng a new s j nto A. Solvng the dual system helps us determne whch acton to take. Snce we now have E = A +1, n the dual system (31) (32), there s one more unknown than the number of equatons. In other words, there s one degree of freedom n the

16 118 Ann Oper Res (2010) 174: system, and we can express γ ( E) as a lnear functon of η. The dervatves of γ wth respect to η can be solved by the followng equatons: E E γ η = 0, (37) γ η f j (x ) = 1, j A. (38) In (37) (38), there are E unknowns and A +1 equatons, so the values of γ ( E) can η be unquely determned. To determne the η for takng the frst acton (removng a pont from E), we calculate for every E: η =, γ l / γ η, (1 γ l )/ γ η, f γ η > 0, f γ η < 0, otherwse. On the other hand, snce (32) must hold for j A, we can determne the η for takng the second acton (addng a new s j nto A) by calculatng: η j = E, η l n =1 γ l f j (x ), γ η f j (x ) 1 f E otherwse, γ η f j (x ) 1 < 0, for every j/ A.Whenη reduces to η l + η j, s j wll jon A.Snceη can only decrease, the overall η s η = max { η ( E), η j (j / A) }, (39) and we get the updated dual varables γ l+1, η l+1 and the sets L l+1, R l+1, E l+1 and A l+1. Once an acton s taken, we restore E = A. The algorthm keeps ncreasng C and alternates between reachng the next event and takng an acton, untl η reduces to 0. Between any two consecutve events the optmal solutons are lnear n C; after an event occurs, the dervatves wth respect to C are changed. Therefore, the soluton path s pecewse lnear n C, and each event corresponds to a knk on the path. Fgure 5 shows the soluton path for a randomly generated data set wth 50 samples and 10 canddate radal bass kernels. For ths partcular data set, t takes 48 teratons to compute the entre soluton path. We can see that the 10 canddate kernels are selected one by one as the algorthm proceeds (or the value of C ncreases). The algorthm ends up wth ŝ j = 1forallj, whch corresponds to the scenaro where the constrant m s j C s fully released. For clarty, we summarze our path algorthm n Table 3. Computatonal cost Our kernel learnng method conssts of two steps: solvng a QP problem and runnng our soluton path algorthm. The frst problem has been extensvely addressed n the lterature, e.g., the SMO algorthm (Platt 1999; Fan et al. 2005) and the soluton path algorthm (Gunter and Zhu 2006). In the second step, our soluton path algorthm solves a seres of LP problems. The major computaton at each teraton s to solve the prmal system (35) (36) and the dual system (37) (38). Snce A s upper bounded by m, the computatonal complexty

17 Ann Oper Res (2010) 174: Fg. 5 Pecewse lnear soluton path for a randomly generated data set Table 3 Outlne of the algorthm Intalzaton: Calculate β0 0 and η0 ; determne A 0, L 0, R 0, E 0 and set l = 0; Step 1: Solve the prmal system (35) (36); Step 2: Calculate C ; dentfy the next event for the prmal system and update the prmal varables; Step 3: Solve the dual system (37) (38); Step 4: Calculate η ; dentfy the next event for the dual system and update the dual varables; Step 5: If the stoppng crteron s met, then stop the algorthm; Step 6: Otherwse, let l = l + 1 and goto Step 1. seems to be O(m 3 ) for each teraton. However, between any two consecutve teratons, only one element n the sets L, E, R and A tends to get changed, so usng nverse updatng and downdatng the computatonal complexty can be reduced to O(m 2 ).Itsdffcult to predct the total number of teratons for completng the algorthm, but our experence suggests that O(max(n, m)) s a reasonable estmate. The heurstc s that the algorthm needs m teratons to nclude all the s j s nto A and n teratons to move all the ponts to E. Therefore, the overall computatonal cost of our soluton path algorthm s approxmately O(max(n, m)m 2 ). References Bach, F., Lanckret, G., & Jordan, M. (2004). Multple kernel learnng, conc dualty, and the SMO algorthm. In Proceedngs of the 21st nternatonal conference on machne learnng (p. 6). Bradley, P., & Mangasaran, O. (1998). Feature selecton va concave mnmzaton and support vector machnes. In Machne learnng proceedngs of the ffteenth nternatonal conference (pp ). Breman, L. (1995). Better subset regresson usng the nonnegatve garrote. Technometrcs, 37, Cao, L., & Tay, F. (2001). Fnancal forecastng usng support vector machnes. Neural Computng and Applcatons, 10,

18 120 Ann Oper Res (2010) 174: Cao, L., & Tay, F. (2003). Support vector machne wth adaptve parameters n fnancal tme seres forecastng. IEEE Transactons on Neural Networks, 14, Chapelle, O., Vapnk, V., Bousquet, O., & Mukherjee, S. (2002). Choosng multple parameters for support vector machnes. Machne Learnng, 46, Crstann, N., Kandola, J., Elsseeff, A., & Taylor, J. (2006). On kernel target algnment. Innovatons n machne learnng: theory and applcatons, Berln: Sprnger. Efron, B., Haste, T., Johnstone, I., & Tbshran, R. (2004). Least angle regresson. The Annals of Statstcs, 32, Fama, E. (1970). Effcent captal markets: a revew of theory and emprcal work. Journal of Fnance, 25, Fama, E. (1991). Effcent captal markets: II. Journal of Fnance, 46, Fan, R., Chen, P., & Ln, C. (2005). Workng set selecton usng second order nformaton for tranng support vector machnes. Journal of Machne Learnng Research, 6, Fredrchs, F., & Igel, C. (2005). Evolutonary tunng of multple SVM parameters. Neurocomputng, 64, Fung, G., Dundar, M., B, J., & Rao, B. (2004). A fast teratve algorthm for fsher dscrmnant usng heterogeneous kernels. In Proceedngs of the 21st nternatonal conference on machne learnng (p. 40). Gestel, T., Suykens, J., Baestaens, D., Lambrechts, A., Lanckret, G., Vandaele, B., Moor, B., & Vandewalle, J. (2001). Fnancal tme seres predcton usng least squares support vector machnes wthn the evdence framework. IEEE Transactons on Neural Networks, 12, Gunter, L., & Zhu, J. (2006). Computng the soluton path for the regularzed support vector regresson. In Advances n neural nformaton processng systems (pp ). Haste, T., Rosset, S., Tbshran, R., & Zhu, J. (2004). The entre regularzaton path for the support vector machne. Journal of Machne Learnng Research, 5, Hrshlefer, D. (2001). Investor psychology and asset prcng. The Journal of Fnance, 4, Huang, W., Nakamor, Y., & Wang, S. (2005). Forecastng stock market movement drecton wth usng support vector machne. Computers and Operatons Research, 32, Keerth, S., Sndhwan, V., & Chapelle, O. (2007). An effcent method for gradent-based adaptaton of hyperparameters n SVM models. In Advances n neural nformaton processng systems (pp ). Km, K. (2003). Fnancal tme seres forecastng usng support vector machnes. Neurocomputng, 55, Km, S., Magnan, A., & Boyd, S. (2006). Optmal kernel selecton n kernel fsher dscrmnant analyss. In Proceedngs of the 23rd nternatonal conference on machne learnng (pp ). Lanckret, G., Crstann, N., Bartlett, P., Ghaou, L., & Jordan, M. (2004). Learnng the kernel matrx wth semdefnte programmng. Journal of Machne Learnng Research, 5, Lo, A., & MacKnlay, C. (2001). A non-random walk down wall street. Prnceton: Prnceton Unversty Press. Lo, A., Mamaysky, H., & Wang, J. (2000). Foundatons of techncal analyss: computatonal algorthms, statstcal nference, and emprcal mplementaton. Journal of Fnance, 55, Malkel, J. (1973). A random walk down wall street. NewYork:W.W.Norton&Company. Nguyen, H., Ohn, S., & Cho, W. (2007). Combned kernel functon for support vector machne and learnng method based on evolutonary algorthm. In Proceedngs of the 11th nternatonal conference on neural nformaton processng (pp ). Ong, C., Smola, A., & Wllamson, R. (2005). Learnng the kernel wth hyperkernels. Journal of Machne Learnng Research, 6, Platt, J. (1999). Fast tranng of support vector machnes usng sequental mnmal optmzaton. In Advances n kernel methods support vector learnng. Cambrdge: MIT Press. Rosset, S., & Zhu, J. (2007). Pecewse lnear regularzed soluton paths. The Annals of Statstcs, 35, Smola, A., & Schoelkopf, B. (2004). A tutoral on support vector regresson. Statstcs and Computng, 14, Tbshran, R. (1996). Regresson shrnkage and selecton va the lasso. Journal of Royal Statstcal Socety, Seres B, 58, Vapnk, V. (1995). The nature of statstcal learnng theory. Berln: Sprnger. Wang, G., Yeung, D., & Lochovsky, F. (2007). A kernel path algorthm for support vector machnes. In Proceedngs of the 24th nternatonal conference on machne learnng (pp ). Wang, X., Chen, S., Lowe, D., & Harrs, C. (2006). Sparse support vector regresson based on orthogonal forward selecton for the generalsed kernel model. Neurocomputng, 70,