Enhancing Q-Learning for Optimal Asset Allocation

Enhncing Q-Lerning for Optiml Asset Alloction Rlph Neuneier Siemens AG, Corporte Technology D-81730 MUnchen, Germny Rlph.Neuneier@mchp.siemens.de Abstrct This pper enhnces the Q-Ierning lgorithm for optiml sset lloction proposed in (Neuneier, 1996 [6]). The new formultion simplifies the pproch by using only one vlue-function for mny ssets nd llows model-free policy-itertion. After testing the new lgorithm on rel dt, the possibility of risk mngement within the frmework of Mrkov decision problems is nlyzed. The proposed methods llows the construction of multi-period portfolio mngement system which tkes into ccount trnsction costs, the risk preferences of the investor, nd severl constrints on the lloction. 1 Introduction Asset lloction nd portfolio mngement del with the distribution of cpitl to vrious investment opportunities like stocks, bonds, foreign exchnges nd others. The im is to construct portfolio with mximl expected return for given risk level nd time horizon while simultneously obeying institutionl or leglly required constrints. To find such n optiml portfolio the investor hs to solve difficult optimiztion problem consisting of two phses [4]. First, the expected yields together with certinty mesure hs to be predicted. Second, bsed on these estimtes, men-vrince techniques re typiclly pplied to find n pproprite fund lloction. The problem is further complicted if the investor wnts to revise herlhis decision t every time step nd if trnsction costs for chnging the lloctions must be considered. disturbnc ies -,--- finncil mrket investmen ts '---- j return investor I--- I--- rtes, prices Mrkov Decision Problem: Xt = ($t, J(t}' stte: mrket $t t = p(xt) p(xt+llx d r( Xt, t, $t+l) nd portfolio J(t policy p, ctions trnsition probbilities return function Within the frmework of Mrkov Decision Problems, MDPs, the modeling phse nd the serch for n optiml portfolio cn be combined (fig. bove). Furthermore, trnsction costs, constrints, nd decision revision re nturlly integrted. The theory ofmdps formlizes control problems within stochstic environments [1]. If the discrete stte spce is smll nd if n ccurte model of the system is vilble, MDPs cn be solved by con-

Enhncing Q-Leming for Optiml Asset Alloction 937 ventionl Dynmic Progrmming, DP. On the other extreme, reinforcement lerning methods using function pproximtor nd stochstic pproximtion for computing the relevnt expecttion vlues cn be pplied to problems with lrge (continuous) stte spces nd without n pproprite model vilble [2, 10]. In [6], sset lloction is fonnlized s MDP under the following ssumptions which clrify the reltionship between MDP nd portfolio optimiztion: 1. The investor my trde t ech time step for n infinite time horizon. 2. The investor is not ble to influence the mrket by her/his trding. 3. There re only two possible ssets for investing the cpitl. 4. The investor hs no risk version nd lwys invests the totl mount. The reinforcement lgorithm Q-Lerning, QL, hs been tested on the tsk to invest liquid cpitl in the Gennn stock mrket DAX, using neurl networks s vlue function pproximtors for the Q-vlues Q(x, ). The resulting lloction strtegy generted more profit thn heuristic benchmrk policy [6]. Here, new fonnultion of the QL lgorithm is proposed which llows to relx the third ssumption. Furthennore, in section 3 the possibility of risk control within the MDP frmework is nlyzed which relxes ssumption four. 2 Q-Lerning with uncontrollble stte elements This section explins how the QL lgorithm cn be simplified by the introduction of n rtificil detenninistic trnsition step. Using rel dt, the successful ppliction of the new lgorithm is demonstrted. 2.1 Q-Leming for sset lloction The sitution of n investor is fonnlized t time step t by the stte vector Xt = ($t, Kt), which consists of elements $t describing the finncil mrket (e. g. interest rtes, stock indices), nd of elements K t describing the investor's current lloction of the cpitl (e. g. how much cpitl is invested in which sset). The investor's decision t for new lloction nd the dynmics on the finncil mrket let the stte switch to Xt+l = ($t+l' K t+1 ) ccording to the trnsition probbility p(xt+lixto t). Ech trnsition results in n immedite return rt = r(xt, Xt+l. t} which incorportes possible trnsction costs depending on the decision t nd the chnge of the vlue of K t due to the new vlues of the ssets t time t + 1. The im is to mximize the expected discounted sum of the returns, V* (x) = E(2::~o It rt Ixo = x). by following n optiml sttionry policy J.l. (xt) = t. For discrete finite stte spce the solution cn be stted s the recursive Bellmn eqution: V (xd = m:-x [L p(xt+llxt, )rt + ~I L Xt+l X.+l p(xt+llxt. ) V* (Xt+l)]. (1) A more useful fonnultiondefines Q-function Q (x, ) of stte-ction pirs (Xt. t), to llow the ppliction ofn itertive stochstic pproximtion scheme, clled Q-Lerning [11]. The Q-vlue Q*(xt,,) quntifies the expected discounted sum of returns if one executes ction t in stte Xt nd follows n optiml policy therefter, i. e. V* (xt) = mx Q* (Xt, ). Observing the tuple (Xt, Xt+l, t, rd, the tbulted Q-vlues re updted

938 R. Neuneier in the k + 1 itertion step with lerning rte 17k ccording to: It cn be shown, tht the sequence of Q k converges under certin ssumptions to Q*. If the Q-vlues Q* (x, ) re pproximted by seprte neurl networks with weight vector w for different ctions, Q* (x, ) ~ Q(x; w ), the dpttions (clled NN-QL) re bsed on the temporl differences dt : dt := r(xt, t, Xt+l) +,),mxq(xt+l; w~) - Q(Xt; wz t ), EA Note, tht lthough the mrket dependent prt $t of the stte vector is independent of the investor's decisions, the future welth Kt+l nd the returns rt re not. Therefore, sset lloction is multi-stge decision problem nd my not be reduced to pure prediction if trnsction costs must be considered. On the other hnd, the ttrctive feture tht the decisions do not influence the mrket llows to pproximte the Q-vlues using historicl dt of the finncil mrket. We need not to invest rel money during the trining phse. 2.2 Introduction of n rtificil deterministic trnsition Now, the Q-vlues re reformulted in order to mke them independent of the ctions chosen t the time step t. Due to ssumption 2, which sttes tht the investor cn not influence the mrket by the trding decisions, the stochstic process of the dynmics of $t is n uncontrollble Mrkov chin. This llows the introduction of deterministic intermedite step between the trnsition from Xt to Xt+1 (see fig. below). After the investor hs "hosen n ction t, the cpitl K t chnges to K: becuse he/she my hve pid trnsction costs Ct = c(kt, t) nd K; reflects the new lloction wheres the stte of the mrket, $t, remins the sme. Becuse the costs Ct re known in dvnce, this trnsition is deterministic nd controllble. Then, the mrket switches stochsticlly to $t+1 nd genertes the immedite return r~ = r' ($t, K:, $t+1) i.e., rt = Ct + r~. The cpitl chnges to Kt+1 = r~ + K;. This trnsition is uncontrollble by the investor. V* ($, K) = V* (x) is now computed using the costs Ct nd returns r~ (compre lso eq. 1) tn.sid..,... torml.lode 110<'_ t+l St St St+l K t t K' t Ct r: Q(SbK~) Kt+l Defining Q* ($t, Kn s the Q-vlues of the intermedite time step Q* ($t, K:) E [r' ($t, K:, $t+1) + ')'V* ($t+1 ' Kt+d]

Enhncing Q-Leming for Optiml Asset Alloction 939 gives rise to the optiml vlue function nd policy (time indices re suppressed), V* ($, K) = mx[c(k, ) + Q* ($, K')], Jl*($, K) = rgmx[c(k, ) + Q*($, K')]. Defining the temporl differences dt for the pproximtion Q k s dt := r' ($t, K:, $t+1) +,mx[c(kt+b ) + Q(k)($t+1, K:+ 1 )] - Q(k)($t, KD leds to the updte equtions for the Q-vlues represented by tbles or networks: QLU: Q(k+l)($t,K;) Q(k)($t, K:) + 1/kdt, NN-QLU: w(k+l) w(k) + 1/kdtV'Q($, K'; w(k»). The simplifiction is now obvious, becuse (NN-)QLU only needs one tble or neurl network no mtter how mny ssets re concerned. This my led to fster convergence nd better results. The trining lgorithm boils down to the itertion of the following steps: QLU for optiml investment decisions 1. drw rndomly ptterns $t, $t+ 1 from the dt set, drw rndomly n sset lloction K: 2. for ll possible ctions : compute rf, c(kt+b ), Q(k)($t+b K:+I) 3. compute temporl difference dt 4. compute new vlue Q(k+1)($t, Kn resp. Q($t, K:; w(k+1») 5. stop, ifq-vlues hve converged, otherwise go to 1 Since QLU is equivlent to Q-Leming, QLU converges to the optiml Q-vlues under the sme conditions s QL (e. g [2]). The min dvntge of (NN-)QLU is tht this lgorithm only needs one vlue function no mtter how mny ssets re concerned nd how fine the grid of ctions re: Q*(($,K),) = c(k,) + Q*($,K'). Interestingly, the convergence to n optiml policy of QLU does not rely on n explicit explortion strtegy becuse the rndomly chosen cpitl K: in step 1 simultes rndom ction which ws responsible for the trnsition from K t. In combintion with the rndomly chosen mrket stte $t, sufficient explortion of the ction nd stte spce is gurnteed. 2.3 M\ldel-free policy-itertion The refonnultion lso llows the design of policy itertion lgorithm by lternting policy evlution phse (PE) nd policy improvement (PI) step. Defining the temporl differences dt for the pproximtion Q~I of the policy JlI in the k step ofpe dt := r' ($t, K;, $t+d +,[c(kt+i, JlI ($t+l, K t+1 )) + Q(k) (K:+ 1, $t+d] - Q(k)(K;, $t} leds to the following updte eqution for tbulted Q-vlues Q(k+l)($ K') Q(k)($ K") d JJI t, t = IJ.I t, t + 1/k t

940 R. Neuneier After convergence, one cn improve the policy J-li to J-lI+l by J-l1+I($t, Kt} = rg mx[c(kt, ) + QJ.'I ($t, KD]. By lternting the two steps PE nd PI, the sequence of policies [J-l1 (x )]1=0,... converges under the typicl ssumptions to the optiml policy J-l* (x) [2]. Note, tht policy itertion is normlly not possible using clssicl QL, if one hs not n pproprite model t hnd. The introduction of the detenninistic intermedite step llows to strt with n initil strtegy (e. g. given by broker), which cn be subsequently optimized by model-free policy itertion trined with historicl dt of the finncil mrket. Generliztion to prmeterized vlue functions is strightforwrd. 2.4 Experiments on the Germn Stock Index DAX The NN-QLU lgorithm is now tested on rel world tsk: ssume tht n investor wishes to invest herihis cpitl into portfolio of stocks which behves like the Germn stock index DAX. Herihis lterntive is to keep the cpitl in the certin sset csh, referred to s DM. We compre the resulting strtegy with three benchmrks, nmely Neuro-Fuzzy, Buy&Hold nd the nive prediction. The Buy&Hold strtegy invests t the first time step in the DAX nd only sells t the end. The nive prediction invests if the pst return of the DAX hs been positive nd v. v. The third is bsed on Neuro-Fuzzy model which ws optimized to predict the dily chnges of the DAX [8]. The heuristic benchmrk strtegy is then constructed by tking the sign of the prediction s trding signl, such tht positive prediction leds to n investment in stocks. The input vector of the Neuro-Fuzzy model, which consists of the DAX itself nd 11 other influencing mrket vribles, ws crefully optimized for optiml prediction. These inputs lso constitutes the $t prt of the stte vector which describes the mrket within the NN-QLU lgorithm. The dt is split into trining (from 2. Jn. 1986 to 31. Dec. 1994) nd test set (from 2. Jn. 1993 to 1. Aug. 1996). The trnsction costs (Ct) re 0.2% of the invested cpitl if K t is chnged from DM to DAX, which re relistic for finncil institutions. Referring to n epoch s one loop over ll trining ptterns, the trining proceeds s outlined in the previous section for 10000 epochs with T}k = "'0. 0.999 k with strt vlue "'0 = 0.05. Tble 1: Comprison of the profitbility of the strtegies, the number of position chnges nd investments in DAX for the test (trining) dt. I strtegy profit I investments in DAX I position chnges I NN-QLU 1.60 (3.74) 70 (73)% 30 (29)% N euro-fuzzy 1.35 (1.98) 53 (53)% 50 (52)% Nive Prediction 0.80 (1.06) 51 (51)% 51 (48)% Buy&Hold 1.21 (1.46) 100 (100)% 0(0)% The strtegy constructed with the NN-QLU lgorithm, using neurl network with 8 hidden neurons nd liner output, clerly bets the benchmrks. The cpitl t the end of the test set (trining set) exceeds the second best strtegy Neuro-Fuzzy by bout 18.5% (89%) (fig. 1). One reson for this success is, tht QLU chnges less often the position nd thus, voids expensive trnsction costs. The Neuro-Fuzzy policy chnges lmost every second dy wheres NN-QLU chnges only every third dy (see tb. 1). It is interesting to nlyze the lerning behvior during trining by evluting the strtegies ofnn-qlu fter ech epoch. At the beginning, the policies suggest to chnge lmost never or ech time to invest in DAX. After some thousnd epochs, these bng-bng strtegies strts to differentite. Simultneously, the more complex the strtegies become the more profit they generte (fig. 2).

Enhncing Q-Leming for Optiml Asset Alloction 941 de~lopmen t 01 the Cpitl 3.5 NN-QLU 2.5 09 o 8. '.' NIVe PredlCllon 1 3.94 18.96 time lime Figure 1: Comprison of the development of the cpitl for the test set (left) nd the trining set (right). The NN-QLU strtegy clerly bets ll the benchmrks. DAX-mvestrnents In ". o i r8ilm CNGf 60 dys opoehs 8000 10000 2000 4000 6000 8000 10000 epochs Figure 2: Trining course: percentge ofdax investments (left), profitbility mesured s the verge return over 60 dys on the trining set (right). 3 Controlling the Vrince of the Investment Strtegies 3.1 Risk-djusted MDPs People re not only interested in mximizing the return, but lso in controlling the risk of their investments. This hs been formlized in the Mrkowitz portfolio-selection, which ims for n lloction with the mximl expected return for given risk level [4]. Given sttionry fo1icy f..l( x) with finite stte spce, the ssocited vlue function V JI. (x) nd its vrince (T (V JI. ( X )) cn be defined s V"(x) E [t. ~'r(x"i", x'+1) xo ~ xl, E [ (t. ~'r(x" p" X'+1) - V"(X)), Xo = x]. Then, n optiml strtegy f..l* (x ;,\) for risk-djusted MDP (see [9], S. 410 for vrincepenlized MDPs) is f..l*(x;,\) = rgmx[vji.(x) -,\(T2(VJI.(x))] for'\ > O. JI. By vrition of '\, one cn construct so-clled efficient portfolios which hve miniml risk for ech chievble level of expected return. But in comprison to clssicl portfolio theory, this pproch mnges multi-period portfolio mngement systems including trnsction costs. Furthermore, typicl min-mx requirements on the trding volume nd other lloction constrints cn be esily implemented by constrining the ction spce.

942 R. Neuneier 3.2 Non-liner Utility Functions In generl, it is not possible to compute (J"2 (V If. (x)) with ( pproximte) dynmic progrmming or reinforcement techniques, becuse (J"2 (VJ.I (x)) cn not be written in recursive Bellmn eqution. One solution to this problem is the use of return function rt, which penlizes high vrince. In finncil nlysis, the Shrpe-rtio, which reltes the men of the single returns to their vrince i. e., r/(j"(r), is often employed to describe the smoothness of n equity curve. For exmple, Moody hs developed Shrpe-rtio bsed error function nd combines it with recursive trining procedure [5] (see lso [3]). The limittion of the Shrpe-rtio is, tht it penlizes lso upside voltility. For this reson, the use of n utility function with negtive second derivtive, typicl for risk verse investors, seems to be more promising. For such return functions n dditionl unit increse is less vluble thn the lst unit increse [4]. An exmple is r = log (new portfolio vlue I old portfolio vlue) which lso penlizes losses much stronger thn gins. The Q-function Q(x, ) my led to intermedite vlues of * s shown in the figure below. --~ ~ -~.--~-~ - ---'-- - '. 1 O. " 01 \. J,. " " t rtmm chnge 01 the pcwtioko I4l1A... % 4 Conclusion nd Future Work e"i ---'--'--_~~~_ I ~"7J r ~J '".., 1 ' '' ~,_l 1II 'i I.-,,----.-:;----0; --;.- :i -. -y:-- -~ % of l'n'8sur'8n11n UncertlWl sset Two improvements of Q-Ieming hve been proposed to bridge the gp between clssicl portfolio mngement nd sset lloction with dptive dynmic progrmming. It is plnned to pply these techniques within the frmework of Europen Community sponsored reserch project in order to design decision support system for strtegic sset lloction [7). Future work includes pproximtions nd vritionl methods to compute explicitly the risk (J"2 (V If. (x)) of policy. References [I J D. P. Bertseks. Dynmic Progrmming nd Optiml Control, vol. I. Athen Scientific, 1995. [2] D. P. Bertseks nd J. N. Tsitsiklis. Neuro-Dynmic Progrmming. Athen Scientific, 1996. [3J M. Choey nd A. S. Weigend. Nonliner trding models through Shrpe Rtio mximiztion. In proc. ofnncm'96, 1997. World Scientific. [4J E. J. Elton nd M. J. Gruber. Modern Portfolio Theory nd Investment Anlysis. 1995. [5J J. Moody, L. Whu, Y. Lio, nd M. Sffell. Performnce Functions nd Reinforcement Lerning for Trding Systems nd Portfolios. Journl of Forecsting, 1998. forthcoming, [6J R. Neuneier. Optiml sset lloction using dptive dynmic progrmming. In proc. of Advnces in Neurl Informtion Processing Systems, vol. 8, 1996. [7J R. Neuneier, H. G. Zimmermnn, P. Hierve, nd P. Nirn. Advnced Adptive Asset Alloction. EU Neuro-Demonstrtor, 1997, [8J R. Neuneier, H. G. Zimmermnn, nd S. Siekmnn. Advnced Neuro-Fuzzy in Finnce: Predicting the Germn Stock Index DAX, 1996. Invited presenttion t ICONIP'96, Hong Kong, vilbel by emil fromrlph.neuneier@mchp.siemens.de. [9J M. L. Putermn. Mrkov Decision Processes. John Wiley & Sons, 1994. [IOJ S. P. Singh. Lerning to Solve Mrkovin Decision Processes, CMPSCI TR 93-77, University of Msschusetts, November 1993. [I I J C. J. C. H. Wtkins nd P. Dyn. Technicl Note: Q-Lerning. Mchine Lerning: Specil Issue on Reinforcement Lerning, 8,3/4:279-292, My 1992..0