Prediction of Stock Market Index Movement by Ten Data Mining Techniques

Vol. 3, o. Modern Appled Scence Predcton of Stoc Maret Index Movement by en Data Mnng echnques Phchhang Ou (Correspondng author) School of Busness, Unversty of Shangha for Scence and echnology Rm 0, Internatonal Exchange Center, o. 56, Jun Gong Road, Shangha 00093, Chna el: 86-36-67-55-547, Fax: +86--55750 E-mal: phchhang@gmal.com Hengshan Wang School of Busness, Unversty of Shangha for Scence and echnology Box 46, o. 56, Jun Gong Road, Shangha 00093, Chna el: 86--557-597 E-mal: wanghs@usst.edu.cn hs wor s supported by Shangha Leadng Academc Dscplne Proect, Proect umber: S30504. Abstract Ablty to predct drecton of stoc/ndex prce accurately s crucal for maret dealers or nvestors to maxmze ther profts. Data mnng technques have been successfully shown to generate hgh forecastng accuracy of stoc prce movement. owadays, n stead of a sngle method, traders need to use varous forecastng technques to gan multple sgnals and more nformaton about the future of the marets. In ths paper, ten dfferent technques of data mnng are dscussed and appled to predct prce movement of Hang Seng ndex of Hong Kong stoc maret. he approaches nclude Lnear dscrmnant analyss (LDA), Quadratc dscrmnant analyss (QDA), K-nearest neghbor classfcaton, aïve Bayes based on ernel estmaton, Logt model, ree based classfcaton, neural networ, Bayesan classfcaton wth Gaussan process, Support vector machne (SVM) and Least squares support vector machne (LS-SVM). Expermental results show that the SVM and LS-SVM generate superor predctve performances among the other models. Specfcally, SVM s better than LS-SVM for n-sample predcton but LS-SVM s, n turn, better than the SVM for the out-of-sample forecasts n term of ht rate and error rate crtera. Keywords: Data mnng, Stoc prce movement predcton, SVM, LS-SVM,, Bayesan classfcaton wth Gaussan processes. Introducton Fnancal maret s a complex, nonstatonary, nosy, chaotc, nonlnear and dynamc system but t does not follow random wal process, (Lo & Macnlay, 988; Deng, 006). here are many factors that may cause the fluctuaton of fnancal maret movement. he man factors nclude economc condton, poltcal stuaton, traders expectatons, catastrophes and other unexpected events. herefore, predctons of stoc maret prce and ts drecton are qute dffcult. In response to such dffculty, data mnng (or machne learnng) technques have been ntroduced and appled for ths fnancal predcton. Most of the studes have focused on the accurate forecastng of the value of stoc prce. However, dfferent nvestors adopt dfferent tradng strateges; therefore, the forecastng model based on mnmzng the error between the actual values and the forecasts may not be sutable for them. In stead, accurate predcton of movement drecton of stoc ndex s crucal for them to mae effectve maret tradng strateges. Some recent studes have suggested that tradng strateges llustrated by the forecasts based on the drecton of stoc prce change may be more effectve and generate hgher proft. Specfcally, nvestors could effectvely hedge aganst potental maret rs and speculators as well as arbtrageurs could have opportunty of mang proft by tradng stoc ndex whenever they could obtan the accurate predcton of stoc prce drecton. hat s why there have been a number of studes loong at drecton or trend of movement of varous nds of fnancal nstruments (such as Wu & Zhang, 997; O connor et al, 997). But, these studes do not use data mnng based classfcaton technques. Data mnng technques have been 8

Modern Appled Scence December, 009 ntroduced for predcton of movement sgn of stoc maret ndex snce the results of Leung et al (000) and Chen et al (00), where LDA, Logt and Probt and eural networ were proposed and compared wth parametrc models, GMM-Kalman flter. Km (003) appled newly and powerful technques of data mnng, SVM and eural networ, to forecast the drecton of stoc ndex prce based on economc ndcators. o obtan more profts from the stoc maret, more and more best forecastng technques are used by dfferent traders. In stead of a sngle method, the traders need to use varous forecastng technques to gan multple sgnals and more nformaton about the future of the marets. Kumar & henmozh (006) collected fve dfferent approaches ncludng SVM, Random forecast, eural networ, Logt and LDA to predct Indan stoc ndex movement based on economc varable ndcators. From the comparson, the SVM outperformed the others n forecastng S&P CX IFY ndex drecton as the model does not requre any pror assumptons on data property and ts algorthm results global optmal soluton whch s unque. Huang et al (005) also forecasted the movement drecton of Japanese stoc maret (IKKEI 5 ndex) by varous technques such as SVM, LDA, QDA, and the all-n-one combned approach. he SVM approach also gves better predctve capablty than other models: LDA, QDA and, followng the out-performance of the combned model. In the study, they defned the movement of the IKKEI 5 ndex based on two man factors ncludng Amercan stoc maret, S&P 500 ndex, whch s the most nfluence on the world stoc marets ncludng Japanese maret, and the currency exchange rate between Japanese Yen and US dollar. In our study, ten dfferent technques of data mnng are dscussed and appled to predct prce movement of Hang Seng ndex of Hong Kong stoc maret. he approaches nclude Lnear dscrmnant analyss (LDA), Quadratc dscrmnant analyss (QDA), K-nearest neghbor classfcaton, aïve Bayes based on ernel estmaton, Logt model, ree based classfcaton, neural networ, Bayesan classfcaton wth Gaussan process, Support vector machne (SVM) and Least squares support vector machne (LS-SVM). he man goal s to explore the predctve ablty of the ten data-mnng technques n forecastng movement drecton of Hang Seng Index based on fve factors, ncludng ts open prce, hgh prce, low prce, S&P 500 ndex, and currency exchange rate between HK dollar and US dollar. he general model of stoc prce movement s defned as D t f ( Ot, Ht, Lt, S & P500t, FX t ), where Dt s the drecton of HSI movement at tme t and s defned as a categorcal value f the closng prce at tme t s greater than the closng prce at tme t and as 0, otherwse. he functon f (.) can be lnear or nonlnear and t s estmated by the ten data-mnng algorthms. O denotes the open prce of HSI at tme t ; t H s the hgh prce of HSI at tme t ; t L s the low prce n a day of HSI at tme t ; t S & P500 s the closng prce of S&P 500 ndex at tme t. t FX t s the currency exchange rate between HK dollar and US dollar. All the nputs are transformed nto log return to remove any trend pattern. he remanng of the paper s organzed as follow. ext sectons descrbe the data and predcton evaluaton. Secton 3 brefly dscusses the ten dfferent algorthms. he fnal secton s for concluson.. Data descrpton We examne the daly change of closng prces of Hang Seng ndex based on fve predctors, Open prce, Hgh prce, Low prce, S&P 500 ndex prce, and Exchange rate USD aganst HKD. he stoc prces are downloaded from the Yahoo fnance and the foregn exchange rate s taen from webste of Federal Reserve Ban of S. Lous. he sample perod s from Jan 03 000 to Dec. 9 006 so that the whole sample s of 73 tradng days. he data s dvded nto two sub-samples where the n-sample or tranng data spans from Jan 03 000 to Dec 30 005 wth 48 tradng days. he whole year 006 from Jan 006 to Dec 9 006 of sze 50 tradng days are reserved for out-of-sample or test data. Fgure dsplays the actual movement of HSI closng prces for the whole sample. Fgure plots S&P 500 prce and ts log return and Fgure 3 shows the plots of prce and log return of exchange rate of HKD aganst USD. o measure the predctve performances by dfferent models, Ht rate and Error rate are employed and defned as Ht rate = m m I[ A P ] and Error rate = I[ A ] m P m where A s the actual output for th tradng day and P s the predcted value for th tradng day, obtaned from each model. Here m s the number of the out-of-sample. R software wth related pacages s used to conduct the whole experment; for example Karatzoglou (004, 006) llustrates the R commands for SVM, LSSVM, and Bayesan classfcaton wth Gaussan Processes. 9

Vol. 3, o. Modern Appled Scence 3. Data-mnng methods Let the tranng data be { ( x, y),, ( x n, yn ) }where X ( X,, X p ) denote real-valued random nput vector and the response Y s categorcal,.e. Y {,,, K }. he goal s to form a predctor G (x) to predct Y based on X. So G(x) dvdes the nput space (or feature vector space) nto a collecton of regons where each labeled by one class. For bnary class problem, Y {, }, the decson boundary between the two classes s a hyperplane n the feature p space. A hyperplane n the p- dmensonal nput space s the set: x : x 0 }. () p { 0 { 0 p { 0 he two regons separated by the hyperplane are x : x 0 } and x : x 0 }. ow we defne Bayes classfcaton rule. Suppose the tranng data { ( x, y),, ( x n, yn ) }are ndependent samples from the ont dstrbuton of X and Y, f X, Y ( x, y) py ( y) f X / Y ( x / Y y) and the loss functon of classfyng Y as G( X ) Yˆ s L ( Y ˆ, Y ), where the margnal dstrbuton of Y s specfed by the pmf p Y (y) and f X / Y ( x / Y y) s the condtonal dstrbuton of X gven Y y. he goal of classfcaton s to mnmze the expected loss defned as EX, Y L( G( X ), Y ) EX [ EY / X L( G( X ), Y )]. o mnmze the left hand sde of the expected loss, t suffces to mnmze E Y / X L( G( X ), Y ) for each X. Hence the optmal classfer s G( X ) arg mn EY / X xl( y, Y ). For 0- loss y functon L( y, y) 0 for y y and 0 otherwse, we have EY / X xl( y, Y ) Pr( Y y / X x). herefore, the classfcaton rule, called Bayes rule, becomes the rule of maxmum a posteror probablty: G( x) arg mn E / L( y, Y ) = arg max Pr( Y y / X x). y Y X x y We consder ten algorthms for classfcaton where some of them attempt to estmate Pr( Y y / X x) and then apply the Bayes rule G( x) arg max Pr( Y y / X x).he algorthms nclude lnear dscrmnant analyss (LDA), y Quadratc dscrmnant analyss (QDA), aïve Bayes based on ernel estmaton, and Logt model. Other types of data mnng technques are K-nearest neghbor classfcaton, ree based classfcaton, neural networ, Bayesan classfcaton wth Gaussan process, Support vector machne (SVM) and Least squares support vector machne (LS-SVM). he last three models tae more advantage va ernel based methods. 3. Lnear Dscrmnant Analyss he goal here s to obtan class posterors Pr( Y / X ) for optmal classfcaton. Suppose f (x) s the class-condtonal K densty of X n class Y K and let be the pror probablty of class K wth. By Bayes theorem, K Pr( Y / X x) f ( x) / p / f ( x). Suppose each class densty s from Gaussan dstrbuton defned / as f ( x) ( ) exp( ( x ) ( ) x and the classes are assumed to have a common covarance matrx. Consderng the log rato of the two classes and l posterors log [Pr( Y / X x) / Pr( Y l / X x) ] = log [ f ( x) / fl ( x) log [ / l ] = log [ / l ] ( ) ( ) ( ) l l x l whch s a lnear equaton n x n p dmensonal hyperplane defned n (), the lnear dscrmnant functons are obtaned ( x) x log and the LDA classfer s G( x) arg max ( x). From the tranng data, we can estmate the Gaussan dstrbuton parameters as ˆ / where s the number of class observatons; ˆ y x / and ˆ x ˆ K ( ˆ )( ˆ y x x ) /( K). For two classes {, }, the LDA rule classfes to class f ( ˆ ˆ ) ˆ ˆ ˆ ˆ ˆ ˆ log( / ) log( / ) and class otherwse. () From the expermental result, we have X ( X, X 5) wth dmenson of p =5 so that fve coeffcents of the lnear dscrmnant n () are obtaned as the followng: V = -0.797685, V = 0.96594, V3 = 0.7465535, V4 = -0.0439, V5 = -.85409. Pror probabltes of groups: 0 and are 0.503737 and 0.496863 respectvely. For the n-sample data, the ht rate s 0.8393 and the error rate s 0.607, whle n the out-of-sample data, the ht rate and error rate are 0.8440 and 0.560 respectvely. 30

Modern Appled Scence December, 009 3. Quadratc dscrmnant analyss For the QDA, the above are not assumed equal for each, then ( x) log ( x ) ( x ) log. he decson boundary between two classes and l s the quadratc equaton { x : ( x) l ( x) }. he estmates for QDA are smlar to those LDA except that separate covarance matrx must be estmated for each class. Here QDA needs ( K ){ p( p 3) / } parameters. See McLachlan (99) and Duda et al (000) for comprehensve dscusson on dscrmnant analyss. From the predcton results, the ht rate and error rate for tranng data by QDA are 0.8305 and 0.695 respectvely. For the test data, the ht rate and error rate are 0.8480 and 0.50 respectvely. 3.3 K-nearest neghborhood method K-nearest neghbor method s one of the smplest machne learnng algorthms used for classfyng obects based on closest tranng examples n the feature space. An obect s classfed by a maorty beng assgned to the class most common amongst ts nearest neghbors. Formally, the -nearest neghbor approach uses the tranng data set { ( x, y),, ( x n, yn ) }closest n nput space to x to form Yˆ. Specfcally, the -nearest neghbor ft for Yˆ s defned as yˆ x ( ) x x y ( ) where (x) s the neghborhood of x defned by closest ponts x n the tranng example. hat s, we fnd the observatons wth x closest to x n nput space and average ther responses. s estmated by cross-valdaton technque. he algorthm starts wth the determnaton of the optmal based on RMSE done by cross valdaton technque, then calculate the dstance between the query dstance and all the tranng samples. After sortng the dstance and determnaton of the nearest neghbors based on the th mnmum dstance, gather the Y of the nearest neghbors. Fnally, use smple maorty of the category Y of nearest neghbors as the predcton value of the query dstance. otceably, the -nearest neghbor approach does not rely on pror probabltes le LDA and QDA. Results: able dsplays the process of choosng by cross-valdaton technque n the experment. We consder = 30 as an ntal range and then select optmal n the range. he best = 0 s obtaned correspondng to smallest error (0.708). [Insert able around here]. he performance results are gven as the followng. For the tranng data, the ht rate s 0.83 and the error rate s 0.688 and for the test data, the ht rate s 0.7960 and the error rate s 0.040. 3.4 aïve Bayes classfcaton method hs s a well establshed Bayesan method prmarly formulated for performng classfcaton tass. Gven ts smplcty,.e., the assumpton that the ndependent varables are statstcally ndependent, ave Bayes models are effectve classfcaton tools that are easy to use and nterpret. ave Bayes s partcularly approprate when the dmensonalty of the feature space (.e., number of nput varables) s hgh (a problem nown as the curse of dmensonalty). Mathemcally, aïve Bayes model requres an assumpton that gven a class Y, the features X are ndependent so p that f ( X ) f ( X ). he estmates of f (.) s from the tranng data va ernel smoothng. aïve Bayes classfcaton s G( x) arg max f ( X ) and s estmated by the sample proportons. See Mtchell (005) for precse explanaton. From the result obtaned n the experment, the ht rate and error rate for n-sample data are 0.8386 and 0.64 respectvely. For the out-of-sample, the ht rate s 0.880 and the error rate s 0.70. 3.5 Logt model Logstc regresson refers to methods for descrbng the relatonshp between a categorcal response varable and a set of predctor varables. It can be used to predct a dependent varable on the bass of ndependents and to determne the percent of varance n the dependent varable explaned by the ndependents. he logstc regresson apples maxmum lelhood estmaton after transformng the dependent nto a logt varable. In ths way, logstc regresson estmates the probablty of a certan event occurrng. ote that logstc regresson calculates changes n the log odds of the dependent, not changes n the dependent tself. From the Fredman et al (008), the model for logstc regresson s gven p exp( 0 X ) as: ( x) Pr( Y / X x) for two classes of output Y. We obtan s usng the p exp( 0 X ) maxmum lelhood approach. ( x) Pr( Y / X x) p Logt s gven by: G( x) log log 0 X. he curve of (x) are called ( x) Pr( Y 0 / X x) sgmod because they are S-shape and therefore nonlnear. Statstcans have chosen the logstc dstrbuton to model a a bnary data because of ts flexblty and nterpretablty. he mnmum for (x) s attaned at lm e /( e ) 0, and the maxmum for (x) s obtaned at lm e /( e ). a a a a 3

Vol. 3, o. Modern Appled Scence From the experment, the coeffcents of s estmated by maxmum lelhood are obtaned as 0 0. 003, -.385,.6697, 3.49, 4 0.0736 and 5-3.93. he performance results show that the ht rate and error rate for n-sample data s 0.8474 and 0.56 respectvely whle the ht rate and error rate for out-of-sample data are 0.8560 and 0.440 respectvely. 3.6 ree based classfcaton Classfcaton tree s one of the man technques used n Data Mnng. Classfcaton trees are used to predct membershp of obects n the classes of a categorcal dependent varable from ther measurements on one or more predctor varables. he goal of classfcaton trees s to predct or explan responses on a categorcal dependent varable, and as such, the avalable technques have much n common wth the technques used n the more tradtonal methods of Dscrmnant Analyss, Cluster Analyss, onparametrc Statstcs, and onlnear Estmaton. he flexblty of classfcaton trees maes them a very attractve analyss opton as t does not requre any assumpton on the dstrbuton le tradtonal statstcal methods. echncally, the ree-based methods partton the feature space nto a set of rectangles, and then ft a smple model n each one. Startng from the root of a tree, the feature space contanng all examples s splt recursvely nto subsets usually two at a tme. Each splt depends on the value of only a unque varable of nput x. If x s categorcal, the splt s of the form x A or x A where A s subset of. he goodness of splt s measured by an mpurty functon defned for each node. he basc dea s to choose a splt such that the chld nodes are purer than ther parent node. he splt contnues tll the end subsets (leaf nodes) are pure ; that s tll one class domnates. For an mpurty functon, defne the mpurty measure by ( t) (Pr(/ t), Pr( / t),, Pr( K / t)) where Pr( / t) s the estmated probablty of class wthn node t. he goodness of a splt s for node t s ( s, t) I( s, t) ( t) pr( tr ) pl( tl ) where pr and pl are the proportons of the samples n node t that go to the rght node and left node respectvely. Possble mpurty functons: a. Entropy: ( t) Pr( / t) log(/ Pr( / t)) 3 K K b. Gn ndex: ( t) Pr( / t)( Pr( / t)) where Pr( / t) I[ x ], I [ ] ( t) x f x and 0 otherwse, xt and (t) s the total number of samples n node t. he crtera s to stop splttng a node t when max I( s, t), s the chosen threshold. It s not trval to choose as t leads to overfttng or underfttng problems for a new data predcton. o solve ths problem, we go for prunng approach. he dea s to obtan the subtree from the ntal large tree. One of the most popular technques s the Cost-complexty prunng. Let max be ntal large tree and let the ~ pruned subtree by. hen the cost complexty measure C ( ) s defned as C ( ) R( ), max ~ denotes the number of leaf nodes. R( ) Pr( t) r( t) s the error measure of, r( t) max Pr( / t) and ~ t 0 s the complexty parameter. Here C ( ) represents the tradeoff between the cost of a tree and ts complexty. he goal of cost-complexty prunng s for each choose a tree ( ) max such that C ( ) s mnmzed. he estmaton of s acheved by crossvaldaton. We choose ˆ that mnmzes the cross-valdaton sum of squares. hus the fnal tree s (ˆ ). ree based method s consdered one of top ten algorthm of data mnng technque (Wu et al, 008). We refer to Qunlan (986) and Fredman et al (008) for detaled dscusson on the tree based classfcaton. For llustratve applcaton of ree method n predctng stoc prce behavor s referred to Pearson (004). Expermental results: Denote varables actually used n tree constructon by V = Open prce, V = Low prce, V3=Hgh prce, V4= S&P500, V5 = FX, V6 = Close prce. he class s when the next prce s larger than the prevous prce, whle the class 0 s refers to when the next prce s smaller than the prevous prce. able show the process of prunng tree. From the able, we choose CP = 0.00368 correspondng to 7 splts based on the smallest value of the x_error: 0.407. he ntal large tree s not dsplayed to reduce the space but pruned tree s llustrated n the appendx secton. he performance result shows that the ht rate s 0.877 and error rate s 0.83 n the tranng data; whle the test data, the ht rate s 0.8 and the error rate s 0.. 3.7 eural networ for classfcaton Hayn (994) defnes neural networ as a massvely parallel dstrbuted processor that has a natural propensty for storng experental nowledge and mang t avalable for use. It resembles the bran n two respects: () Knowledge s acqured by the networ through a learnng process, and () Interneuron connecton strengths nown as synaptc weghts are used to store the nowledge. he lterature on neural networ s enormous, and ts applcaton spreads over ss

Modern Appled Scence December, 009 many scentfc areas see Bshop (995) and Rpley (996) for detaled. Recently, the neural networ has been well nown for good capablty n forecastng stoc maret movement. Let s go drectly to ts formulaton. Let X be nput vector and Y be output tang value as categorcal. Followng notatons n Haste (996), the neural networ model can be represented as z ( 0 x),,, m yˆ f ( 0 z),,, q z where ( z) /( e ) actvaton functon called sgmod. he parameters l and are nown as weghts and v 0 and 0 are bas. Here we use f ( v) /( e ) the nverse logt for bnary classfcaton. o learn the neural networ, bac propagaton s used. Suppose we use least squares on a sample of tranng data to learn the weghts: ( y ) R(, ) yˆ, R ( y yˆ ) f ( z ) z, R l he th component s denoted by superscrpt. Gradent update at the (r+)st teraton, ( ) K ( y yˆ ) f ( z ) ( x ) xl. ( r) R r ( r) ( r ) ( r) R l l r. ( r) l Here r s the learnng rate. In the expermental analyss, l,, 5 ;, ;, so that a 5-- networ wth 8 weghts are obtaned wth the followng estmated weghts: 0 9.33. 57 6. 4 3-3.06 4.40 5.70 0-0.9-0.84 0.86 3 0.87 4 0.09 5 -.5 0.59 3.99-5.8 0 -.57-3.99 5.80 From the predcton performances, the ht rate s 0.848 and error rate s 0.59 n the n-sample data. For the out-of-sample, 0.850 and 0.480 are ht rate and error rate respectvely. 3.8 Bayesan classfcaton for Gaussan process Gaussan processes are based on the pror assumpton that adacent observatons should convey nformaton about each other. Partcularly, t s assumed that the observed varables follow normal dstrbuton and that the couplng between them taes place by covarance matrx of a normal dstrbuton. Usng the ernel matrx as the covarance matrx s a convenent way of extendng Bayesan modelng of lnear estmators to nonlnear stuatons. In regresson problem, the goal s to predct a real valued output based on a set of nput varables. It s possble to carry out nonparametrc regresson usng Gaussan process. Wth Gaussan pror and Gaussan nose model, the soluton of the regresson problem can be obtaned va Kernel functon placed on each tranng data; the coeffcents are determned by solvng a lnear system. If the parameter ndexed n the Gaussan process are unnown, Bayesan nference can be carred out for them. Gaussan process can be extended to classfcaton problems by defnng a Gaussan process over y, the nput to the sgmod functon. he goal s to predct Pr( Y / X x),,, K. For bnary case, 0,, Pr( Y / X x) s estmated by ( y( x)) where ( y) /( e ). he dea s to place the Gaussan pror on y(x) and combne t wth the tranng data D ( x, t ),,, n to obtan predctons for new x ponts. Bayesan treatment s mposed by ntegratng over uncertanty n y and n the parameters that control the Gaussan pror. hen y 33

Vol. 3, o. Modern Appled Scence the Laplace s approxmaton s employed to obtan the results of the ntegraton over y. Specfcally, let Pr(y) be the pror of y ( y( x ), y( x),, y( xn )) so that Pr( y, y) s the ont dstrbuton ncludng y. Gven new nput x, we want to predct y y( x ) based on the D ( x, t ),,, n. Let Pr( t / y) be the probablty of observng the partcular values t t, t,, ) gven actual values y (.e., nose model). hen we have ( t n Pr( y / t) Pr( y, y / t) dy Pr( t) Pr( y, y) Pr( y) Pr( t / y) dy Pr( y, y) Pr( y / t) dy. Hence the predctve dstrbuton for y s found from the margnalzaton of the product of the pror and the nose model. he ntegral terms are estmated by Laplace s approxmaton. Wllam et al (999) and Rasmussen et al (006) provded a comprehensve and detaled dscusson on the Bayesan classfcaton wth Gaussan processes. he followng shortly descrbes the expermental results obtaned from tranng and forecastng the movement of HSI by Bayesan classfcaton wth Gaussan processes. Problem type: classfcaton Gaussan Radal Bass ernel functon hyperparameter: sgma = 0.4496784066 umber of tranng nstances learned: 48 ran error: 0.38758065 Cross valdaton error: 0.647560 In-sample data: Ht rate s 0.8595 and error rate s 0.405. Out-of-sample data: Ht rate s 0.850 and error rate s 0.480. 3.9 Support vector machne for classfcaton A popular machne learnng algorthm wth neural networ type s SVM, support vector machne, developed by Vapn (995). he SVM s a ernel based learnng approach le the above Gaussan processes for classfcaton. However, the SVM does not requre any assumptons on the data property le n Gaussan processes. he SVM has been successfully appled for varous areas of predctons; for nstance, n fnancal tme seres forecastng (Muheree et al, 997; ay and Cao, 00), maretng (Bend-Davd and Lndenbaum, 997), estmatng manufacturng yelds (Stoneng, 999), text categorzaton (Joachms, 00), face detecton usng mage (Osuna et al, 997), handwrtten dgt recognton (Burges and Schoopf, 997); Cortes and Vapn, 995), medcal dagnoss (arasseno et al, 995). he SVM formulaton can be started as the followng. Gven a tranng set ( x, y ),,,, wth nput data x R and correspondng bnary class label y {, }, SVM algorthm sees the separatng hyperplane wth largest margn. he problem can be formulated as follow: mn, w, b w w subect to y ( w x b),,. (4) Standard method to solve the problem (3)-(4) s convex programmng, where Lagrange method s appled to transfer prmal to dual problems of optmzaton. Specfcally, we frst construct Lagrangan LP w w [ y ( w x b) ] (5) are nonnegatve Lagrange multplers correspondng to (4). he soluton s acheved by a saddle pont of the Lagrangan whch has to be mnmzed wth respect to w and b and maxmzed wth respect to. Dfferentatng (5) and set the results equal to zero, LP w w y x (6) 0 LP y 0 b (7) he optmal soluton s obtaned from (6) as w y x (8) where denotes optmal values. ow substtutng (8) and (7) nto (5), n (3) 34

Modern Appled Scence December, 009 LD w w y y he dual problem s posed as quadratc programmng: maxmze LD 0 subect to y x x (9) 0 C,,,. he condtons [ y ( w x b ) ] 0,,, (0) mples that 0 only when constrant (4) actve. he vectors for whch 0 are called support vectors. By (0), we obtan b y w x for any support vector x. By lnearty of the nner product and (8), the decson functon for the lnear separable case s w x b sgn y ( x x f ( x) sgn b. ) For lnearly non-separable case, we ntroduce a new set of varables volaton of the constrants. hus (3) and (4) are modfed as w, b, { },,, that measure the amount of mn w w C (), subect to y w x b,, () 0,,, (3) where C and are predetermned parameters defnes the cost of constrants. he Lagrangan s constructed as L P w w [ y ( w x b) ] C where and to ths problem s determned by mnmzng are Lagrange multplers whch are assocated wth constrants () and (3) respectvely. he soluton. Dfferentatng (4) and settng equal to zero, LP w LP b L P w y x 0 0 LP wth respect to (4) w, and b and maxmzng wth respect to and (5) y (6) C C 0 or 0 0,, by denotng C From (5), w / Substtutng (8),(7),(6) nto (4), we obtan / for. (7) y x. (8) / LD y y x x ( ). / ( C) hs leads to the dual problem as maxmze LD / y y x x ( ) / ( C) 35

Vol. 3, o. Modern Appled Scence subect to y 0 0,,,. When, max L D subect to y y y 0 0 C,,,. Hence the classfer s ) f ( x) sgn y ( x x b where x b y w x for any support vector [ y ( w x b ) ] 0,,,. x x such that 0 C followng from For nonlnear classfer, we need to map the nput varable x nto a hgher dmensonal feature space and wor wth lnear classfcaton n that space,.e., x ( x) ( x),, n n( x),where n n are some real functons. n n he soluton of the SVM has the form sgn ( x) w sgn y ( x) ( x b f ( x) b ) ( o avod complex calculaton of scalar product x) ( x ), we ntroduce ernel functon: K( x, y) ( x) ( x) n nn( x) n( y) whch satsfes Mercer s condtons. Hence, f ( x) sgn y K( x, x b are some real numbers and ). In ths wor, Gaussan ernel or RBF(radal bass functon) s used as t tends to gve good performance under general smoothng assumptons. he ernel s defned as K( x, x) exp( x x ). he ernel and regularzed parameters ( C, ) wth range [, ] are tuned by grdsearch technque to avod overftng problem. By applyng the above algorthm of SVM to tran the data under study, tranng results are obtaned. able 3 llustrates the cross-valdaton error correspondng to the tunng parameters (C, ). From the able 3, we obtan the best hyperparameters C = 4 4 and (the Gaussan ernel functon parameter) wth the smallest error of ten fold cross-valdaton = 0.000006. Consderng the n-sample data, the ht rate s.000 and error rate s 0.000 but for the out-of-sample, the ht rate s 0.860 and the error rate s 0.40. 3.0 Least square support vector machne for classfcaton LSSVM s a new verson of SVM modfed by Suyens et al (999). LSSVM uses least square loss functon to obtan a set of lnear equatons n dual space so that learnng rate s faster and the complexty of calculaton n convex programmng (n SVM) s relaxed. In addton, the LSSVM avods the drawbac faced by SVM such as trade-off parameters ( C,, ) selecton; nstead the LSSVM requres only two hyper-parameters (, ) to tran the model. Accordng to Suyens et al (00), the equalty constrants of LSSVM can act as recurrent neural networ and nonlnear optmal control. Due to these nce propertes, LSSVM has been successfully appled for classfcaton and regresson problems, ncludng tme seres forecastng. Further applcaton can be found n Van Gestel et al (004) for detaled dscusson on classfcaton performance of LSSVM and Van Gestel et al (00) and Ye et al (004) for predctve capablty of LSSVM n chaotc tme seres predcton. Formally, gven a tranng set y {, }, ( x, y ),,, wth nput data, LSSVM formulaton s represented as follow: x R n 5 and correspondng bnary class label 5 36

Modern Appled Scence December, 009 J e subect to the equalty constrants y[ w ( x ) b] e,,,. hs mn ( w, e) w w w, b, e formulaton conssts of equalty nstead of nequalty constrants and taes nto account a squared error wth regularzaton term smlar to rdge regresson. he soluton s obtaned after constructng the Lagrangan: ( L w, b, e ; ) J ( w, b, e) { y [ w ( x ) b] e }, where are Lagrange multplers that can be postve or negatve n the LSSVM formulaton. From the condtons for optmalty, one obtans the Karush-Kuhn-ucer (KK) L 0 w y( x ) w L 0 y 0 b L 0 e,, e L 0 y[ w ( x ) b] e By elmnatng w and e, 0 y b 0 y I v 0,,,. wth y y,, y ], [,,] e e,, e,,, ]. [ v [ ] [ Matrx y y ( x ) ( x ) y y K( x, x ) for,,, satsfes Mercer s condton and the LS-SVM model for estmatng classfer s obtaned as ) y( x) sgn y K( x, x b. In ths wor, Gaussan ernel or RBF(radal bass functon) s used as t tends to gve good performance under general smoothng assumptons. he ernel s defned as K( x, x ) exp( x ). he ernel and regularzed x parameters (, ) are tuned by grdsearch technque to avod overftng problem. he expermental results show that the obtaned hyper-parameters: gamma s 0.5069 chosen from the range [0.04978707, 48.43] and sgma square s 8.7544 selected from the range [0.08085,.85]. he cost of ten fold cross-valdaton s 0.0339. from the forecastng performance, the n-sample data ht rate s 0.858 and error rate s 0.47. For the out-of-sample data, the ht rate s 0.8640 and error rate s 0.360. o sum t up, able 4 below gves a summary result of predcton performances by the ten dfferent approaches. From the able 4, we can see that almost all the algorthms generate hgh ht rate (more than 80%) and low error rate (less than 0%). By comparson, LSSVM rans frst one as t outperforms the other model though t s not better than SVM for n-sample predcton. he superor performances of the SVM models n ths study also support the results n the lterature. Bayesan classfcaton for Gaussan process produces good predcton as neural networ, followng the SVM and LS-SVM. K-nearest neghbor approach gves the worst predctve ablty and then ree classfcaton s the next one. 4. Concluson In ths paper, we apply ten dfferent technques of data mnng to forecast movement drecton of Hang Seng ndex from Hong Kong stoc maret. All algorthms produce good predcton wth ht rate more than 80%. he LS-SVM and SVM outperform the other models snce theoretcally they don t requre any pror assumpton on data property and ther algorthms guarantee to effcently obtan global optmal soluton whch s unque. he other models may be relable for other marets, especally when the data fall nto each of ther propertes. As can be seen n the fgure -3 n appendx secton, dfferent stoc prces behave dfferently. herefore, all the approaches are recommended for forecasters of stoc ndex movement and the better models, SVM and LS-SVM, are more preferred. 37

Vol. 3, o. Modern Appled Scence References Ben-Davd, S., & Lndenbaum, M. (997). Learnng dstrbutons by ther densty levels: A paradgm for learnng wthout a teacher, Journal of Computer and System Scences 55, 7-8. Bshop, C. (995). eural etwors for Pattern Recognton, Clarendon Press, Oxford. Burges, C.J.C., & Schoopf, B. (997). Improvng the Accuracy and Speed of Support Vector Machnes, Advances n eural Informaton Processng Systems, MI Press, Cambrdge, MA. pp. 475-48. Cao, L.J., & ay, F. (00). Applcaton of support vector machnes n fnancal tme seres forecastng, Internatonal Journal of Management Scence, pp. 309-37. Chen, An-Sng, Daou, Hazem & Leung, Mar. Applcaton of eural etwors to an Emergng Fnancal Maret: Forecastng and radng the awan Stoc Index. [Onlne] Avalable: http://ssrn.com/abstract=37038 or DOI: 0.39/ssrn.37038 (July 00) Cortes, C., & Vapn, V.. (995). Support Vector etwors, Machne Learnng 0, 73-97. Deng, Mn. Pattern and Forecast of Movement n Stoc Prce. [Onlne] Avalable: http://ssrn.com/abstract =7048 or DOI: 0.39/ssrn.7048 (ovember 8, 006) Duda, R., Hart, P., & Stor, D. (000). Pattern Classfcaton (Second Edton), Wley, ew Yor. McLachlan, G. J. (99), Dscrmnant Analyss and Statstcal Pattern Recognton, Wley, ew Yor. Fredman, J., Haste,., & bshran, R. (008). he Elements of Statstcal Learnng:Data Mnng, Inference and Predcton, Second Edton Sprnger. Haste,. (996). eural etwor, Encyclopeda of Bostatstcs, John Wley. Hayn, S. (994). eural etwors: A Comprehensve Foundaton, Macmllan, ew Yor. Huang, W., aamor, Y. & Wang, S.Y. (005). Forecastng stoc maret movement drecton wth support vector machne, Journal of Computers & Operatonal Research, pp. 53-5. Joachms,. (00). Learnng to Classfy ext usng Support Vector Machnes, Kluwer Academc Publshers, London. Karatzoglou, A., Mayer, D., & Horn, A. (006). Support Vector Machnes n R, Journal of Statstcal Software. Karatzoglou, A., Smola, A., Horn, A., & Zeles, A.. (004). An S4 Pacage for Kernel Methods n R, Journal of Statstcal Software. Km, K.J. (003). Fnancal tme seres forecastng usng support vector machnes, euralcomputng, 55, 307-39. Kumar, Mansh and henmozh, M. Forecastng Stoc Index Movement: A Comparson of Support Vector Machnes and Random Forest, Indan Insttute of Captal Marets 9th Captal Marets Conference Paper. [Onlne] Avalable: http://ssrn.com/abstract=876544(february 06, 006) Leung, Mar., Daou, Hazem, & Chen, An-Sng. Forecastng Stoc Indces: A Comparson of Classfcaton and Level Estmaton Models. [Onlne] Avalable: http://ssrn.com/abstract=0049 (March 999) Lo, A.W., & MacKnlay, A.C. (988). Stoc maret prces do not follow random wals: Evdence from a smple specfcaton test, Revew of Fnancal Studes, 4-66. Mtchell,. M. (007). Machne learnng, McGraw Hll, USA. Muheree, S., Osuna, E., & Gros, F. (997). onlnear predcton of chaotc tme seres usng support vector machne. In: Proceedngs of the IEEE worshop on eural etwors for Sgnal Processng, Amela Island, FL., pp. 5-50. O Connor, M., Remus, W., & Grggs, K. (997). Gong up-gong down: How good are people at forecastng trends and changes n trends? Journal of Forecastng 6, 65-76. Osuna, E.E, Freund, R., & Gros, F. (997). Support Vector Machnes: ranng and Applcatons, MI press, USA. Pearson, R.A.(004). ree structures for predctng stoc prce behavour, AZIAM J., Austral. Mathematcal Soc. 45, pp C950-C963 Qunlan, J.R. (986). Inducton of Decson rees, Journal of Machne Learnng, :8-06, Kluwer Academc Publshers. Rasmussen, C. E. & Wllams, C. K. I. (006). Gaussan Processes for Machne Learnng, the MI Press, ISB 06853X. Rpley, B.D. (996). Pattern recognton and neural networs, Cambrdge Unversty Press. Stoneng, D. (999). Improvng the manufacturablty of electronc desgns, IEEE Spectrum 36, 70-76. 38

Modern Appled Scence December, 009 Suyens, J.A.K, & Vandewalle, J., (999). Least squares support vector machne classfers. eural Processng Letters (9) 93-300. Suyens, J.A.K., Vandewalle, J., & De Moor, B. (00). Optmal control by least squares support vector machnes. eural etwors (4) 3-35. arasseno, L., Hayton, P., Cerneaz,., & Brady, M. (995). ovelty detecton for the dentfcaton of masses n mammograms. In: Proceedngs fourth IEE Internatonal Conference on Artfcal eural etwors, Cambrdge, pp. 44-447. Van Gestel,., Suyens, J.A.K, Baesens, B., Vaene, S., Vanthenen, J., Dedene, G., Moor B.D., & Vandewalle, G. (004). Benchmarng Least Squares Support Vector Machne Classfers, Machne Learnng, (54) 5-3. Van Gestel,.V., Suyens, J.A.K, Baestaens, D.E., Lambrechts, A., Lancret G., Vandaele B., De Moor B., & Vandewalle, J.(00). Fnancal me Seres Predcton usng Least Squares Support Vector Machnes wthn the Evdence Framewor, IEEE ransactons on eural etwors, () 809-8. Vapn, V.. (995). he nature of statstcal learnng theory, Sprnger-Verlag, ew Yor. Wllams, C. K. I. & Barber, D. (999). Bayesan classfcaton wth Gaussan Processes, Pattern Analyss and Machne Intellgence, 0() 34-35, IEEE. Wu, Y., & Zhang, H. (997). Forward premums as unbased predctors of future currency deprecaton: A non-parametrc analyss, Journal of Internatonal Money and Fnance 6, 609-63. Wu, X.D., Kumar V., Qunlan J.R., Ghosh J., Yang Q., Motoda, H., McLachlan G.J., g A., Lu B., Yu P.S., Zhou Z.H., Stenbach, M., Hand, D.J., & Stenberg, D. (008). op 0 algorthms n data mnng, Knowledge Informaton System, Sprnger-Verlag London. Ye, M.Y., & Wang, X.D. (004). Chaotc tme seres predcton usng least squares support vector machne, J. Chnese Physcs, IOP Publshng Ltd. IP address: 30.0.8.000. Appendces: able. Selecton of by cross-valdaton technque = 3 4 5 6 7 8 9 0 0.30 0.75 0.964 0.03 0.897 0.97 0.789 0.796 0.735 0.708 = 3 4 5 6 7 8 9 0 0.79 0.79 0.74 0.735 0.769 0.86 0.76 0.83 0.769 0.789 = 3 4 5 6 7 8 9 30 0.749 0.776 0.79 0.769 0.79 0.769 0.776 0.76 0.749 0.749 he table dsplays the mean square error and correspondng value of. he optmal s 0 wth smallest MSE 0.708. 39

Vol. 3, o. Modern Appled Scence able. Process of prunng tree CP n.splt rel error x_error x_std 0.5387755 0.00000.0769 0.0678 0.08574 0.46 0.47 0.077 3 0.049660 3 0.40408 0.43673 0.0573 4 0.0794 5 0.3745 0.4905 0.050 5 0.0088435 9 0.3653 0.404 0.075 6 0.008633 0.30884 0.4905 0.050 7 0.006807 3 0.95 0.436 0.048 8 0.004769 5 0.789 0.4086 0.0044 9 0.00368 7 0.6939 0.407 0.00938 0 0.007 0 0.5850 0.40408 0.00965 0.000408 0.5578 0.40680 0.007 0.007007 5 0.476 0.4497 0.073 3 0.003605 9 0.408 0.4497 0.073 4 0.000000 36 0.39 0.4857 0.046 Denote varables actually used n tree constructon by V = Open prce, V = Low prce, V3=Hgh prce, V4= S&p500, V5 = FX, V6 = Close prce. he class s when the next prce s larger than the prevous prce, whle the class 0 s refers to when the next prce s smaller than the prevous prce. We choose CP = 0.00368 correspondng to 7 splts based on the smallest value of the x_error : 0.407 to obtan the followng pruned tree: Pruned ree V3< -0.38 0 < -0.0647 0 V< 0.6049 V>=-0.3353 V>=0.663 0 0 V< 0.0458 V3< 0.3993 0 0 V3< 0.4359 V>=-0.538 0 0 V4< -0.0667 V< 0.848 V< -.35 0 0 V>=.005V5>=-0.086 V4>=0.5805 V3< 0.06 0 0 0 V4< 0.468 V>=-.07 0 0 0 V< -0.403 V< -0.9 0 0 0 40

Modern Appled Scence December, 009 able 3. Hyperparamters selecton based on 0 fold cross-valdaton for SVM C / 5 4 3 0 5 0.679 0.596 0.53 0.489 0.455 0.487 0.69 0.479 0.3967 0.4647 0.484945 4 0.576 0.5 0.449 0.44 0.369 0.358 0.46 0.643 0.96 0.4 0.4490 3 0.498 0.44 0.400 0.353 0.304 0.46 0.47 0.345 0.699 0.359 0.38449 0.439 0.39 0.347 0.305 0.33 0.50 0.084 0.073 0.34 0.73 0.63409 0.405 0.363 0.39 0.69 0.83 0.069 0.0930 0.0806 0.0676 0.068 0.095686 0 0.384 0.34 0.306 0.4 0.4 0.006 0.0830 0.066 0.0373 0.05 0.00404 0.364 0.33 0.85 0.00 0.0 0.096 0.0746 0.059 0.038 0.006 0.000574 0.35 0.34 0.7 0.70 0.066 0.0874 0.0680 0.0409 0.043 0.008 0.000006 3 0.339 0.96 0.5 0.5 0.007 0.083 0.063 0.0306 0.008 0.00009 0.000006 4 0.330 0.83 0.6 0.36 0.095 0.0776 0.0536 0.00 0.0030 0.000006 0.000006 5 0.3 0.79 0.06 0.05 0.0896 0.079 0.0443 0.06 0.0004 0.000006 0.000006 3 4 5 he table llustrates the cross-valdaton error correspondng to the tunng parameters (C, ). he tranng result shows 4 4 that (C, )=(, ) whch corresponds to the smallest tranng error = 0.000 006. able 4. Summary result of predcton performances by ten dfferent approaches In-sample Out-of-sample Predctors Ht Rate Error Rate Ht Rate Error Rate Ran LDA 0.8393 0.607 0.8440 0.560 6 QDA 0.8305 0.695 0.8480 0.50 5 K- 0.83 0.688 0.7960 0.040 9 aïve Bayes 0.8386 0.64 0.880 0.70 7 Logt model 0.8474 0.56 0.8560 0.440 3 ree classfcaton 0.877 0.83 0.8000 0.000 8 eural networ 0.848 0.59 0.850 0.480 4 Gaussan process 0.8595 0.405 0.850 0.480 4 SVM.0000 0.0000 0.8600 0.400 LS-SVM 0.858 0.47 0.8640 0.360 4

Vol. 3, o. Modern Appled Scence. x 04 HSI prce.8.6.4. 0.8 Jan 000 Jan 00 Jan 004 Dec 006 Fgure. Daly closng prces of Hang Seng Index 600 S & P 500 ndex prce 6 S & P 500 log return 500 4 400 300 00 0 00-000 -4 900 800-6 700 Jan 000 Jan 00 Jan 004 Dec 006-8 Jan 000 Jan 00 Jan 004 Dec 006 Fgure. Daly closng prces (left) and Log return (rght) of S & P 500 ndex 7.84 HKDvsUSD exchange rate 0.3 Log return of HKDvsUSD exchange rate 7.8 0. 7.8 0. 7.78 0-0. 7.76-0. 7.74-0.3 7.7-0.4 7.7 Jan 000 Jan 00 Jan 004 Dec 006-0.5 Jan 000 Jan 00 Jan 004 Dec 006 Fgure 3. Daly prces (left) and Log return (rght) of currency exchange rate between HKD and USD 4