Global Search in Combinatorial Optimization using Reinforcement Learning Algorithms

Global Search n Combnatoral Optmzaton usng Renforcement Learnng Algorthms Vctor V. Magkkh and Wllam F. Punch III Genetc Algorthms Research and Applcaton Group (GARAGe) Mchgan State Unversty 2325 Engneerng Buldng East Lansng, MI 48824 Phone: (517) 353-3541 E-mal: {magkkh,punch}@cse.msu.edu Abstract- Ths paper presents two approaches that address the problems of the local character of the search and mprecse state representaton of renforcement learnng (RL) algorthms for solvng combnatoral optmzaton problems. The frst, Bayesan, approach ams to capture soluton parameter nterdependences. The second approach combnes local nformaton as encoded by typcal RL schemes and global nformaton as contaned n a populaton of search agents. The effectveness of these approaches s demonstrated on the Quadratc Assgnment Problem (QAP). Compettve results wth the RL-agent approach suggest that t can be used as a bass for global optmzaton technques. 1 Introducton The success of a generate-and-test algorthm for a partcular problem s determned by factors such as: ts ablty to use past experence to form feasble soluton ts explotaton/exploraton strategy, ts utlzaton of problemspecfc nformaton etc. In creatng a feasble soluton, the algorthm has to make a number of decson e.g. whch value should be assgned to a partcular free parameter. The qualty of the soluton generated s often the only type of feedback avalable after a sequence of decsons s made. Snce we expect the algorthm to make decsons whch result n better solutons over tme, the problem of ntellgent soluton generaton can be approached wth renforcement learnng (RL). The problems wth delayed renforcement that RL approaches face are well modeled by Markov Decson Processes (MDPs). MDPs are defned by: a set of Markov state the actons avalable n those state the transton probabltes and the rewards assocated wth each stateacton par. Model based RL-algorthms are explctly lookng for MDP soluton, an optmal polcy, whch s a mappng from MDP states to actons whch maxmzes the expected average reward receved by followng a path through MDP states. An acton-value functon for a polcy s defned as a mappng from each state-acton par to the expected average reward obtaned by choosng an acton n that state accordng to the gven polcy and followng that polcy thereafter. The state-value functon for a polcy specfes the desrablty of a state and s defned as the expected average reward obtaned by followng that polcy from a gven state. Snce probabltes are not always known, typcal RL algorthm e.g. SARSA or Q-learnng, are model free. Iteratve updates used by these algorthms do not use transton probabltes and are proven to converge to optmal value functon. A greedy polcy that chooses an acton accordng to the maxmum optmal acton-value s known to be globally optmal based on the expected average reward crteron. Readers nterested n a more detaled treatment of RL should read references such as Sutton and Barto (1997). For an example of an optmzaton problem formulated n RL term consder an RL approach to the Travelng Salesman Problem (TSP). The states are the ctes. The actons are the choces of the next cty to vst, and the acton-values ndcate the desrablty of the cty to vst next. Global reward s the nverse of the tour length. Immedate rewards can be defned as nverse of the dstance between a par of ctes. The dea of usng RL n optmzaton problem solvng s almost as old as RL tself. It was frst studed n the n- armed bandt by Bellman (1956) and later appled to more dffcult optmzaton problems by varous researchers. For example, Dorgo (1992,1996) has developed an optmzaton technque known as Ant Systems(AS). The key dea behnd AS s a heurstc approxmaton to actonvalue whch he terms pheromone. Even though AS were derved by smulatng the behavor of a populaton of ant they have much n common wth other RL algorthms. Another applcaton of RL to optmzaton s that of Crtes and Barto (1996) where they appled Q-learnng to elevator schedulng. Ths paper s partcularly relevant to our research snce t explores the possbltes of mult-agent RL algorthms n optmzaton. Each agent n a team of RL algorthms controls a partcular elevator car cooperatvely solvng the entre problem. Among other relevant publcaton Gambardella and Dorgo (1995) who

descrbed the applcaton of Q-learnng to TSP and asymmetrc TSP. Zhang and Detterch (1996) use TD(λ) to solve a Job-Shop Schedulng problem. Sngh and Bertsekas (1996) used RL for the channel allocaton problem. In order to dscuss advantages and dsadvantages of RL n optmzaton, let us frst contrast them aganst another well-known optmzaton technque, genetc algorthms (GA). Whle RL supports value-functon, whch reflects the desrablty of free parameter assgnment GA approaches explctly vew only the overall ftness of a soluton. In constructng a new soluton, GAs are not guded by any synthetc ftness values assocated wth any smaller part of soluton. Rather, GAs are guded by schema theory whch states that the more favorable a partcular choce of values for a subset of soluton parameters the more frequently such a schema appears as a part of solutons n the populaton. These buldng blocks thus represent the preferred values of soluton parameters and ther combnatons. The ablty to both explore and explot schemata n the search space s the key to GA success as frst ponted out by Holland (1975). Thu each schema n a GA has an mplct probablty of appearng n generated soluton where the better a schema the hgher the probablty of t occurrng n a soluton. Such a representaton s smlar to tossng a con and storng all outcomes nstead of the number of trals and the number of heads. Ths rases the queston: s a populaton of solutons an accurate and computatonally effectve way of representng the preferences of free parameter choces as compared to some form of suffcent statstcs? Snce the number of schemata grows exponentally wth the sze of the problem, mantanng values assocated wth each possble combnaton of parameters becomes prohbtve. On the other hand, GAs do not drectly learn from bad experence. Moreover, fnte populatons can drop some alleles from the populaton and there s only a slght chance that they may be rentroduced va mutaton, and they may not survve to be used. Use of RL technques n optmzaton problems has good and bad aspects. On the postve sde, they are proven to converge to optmum gven the rght crcumstances and are applcable to problems wth a large number of states. They can also be used n conjuncton wth functonapproxmaton technques to add generalzaton and reduce space requrements. Boyan and Moore (1998) report good results on a number of dscrete and contnuous optmzaton problems usng ths approach. Drect estmaton of desrablty of assgnments by value functons has a potental to be both more precse and computatonally cheaper than other approaches. Ths possblty s one of the major motvatons for conductng research on applcablty of RL algorthms to optmzaton. There are also dsadvantages. The frst s the local rather than global character of search n the RL schemes proposed so far. The algorthm has to explore the space by choosng probablstcally from among all acton not just the acton wth the hghest acton-value estmate. In combnatoral optmzaton, even one ncorrect exploratory step can serously damage the qualty of the resultant soluton. Therefore, to generate a good soluton, the most preferable acton has to be selected most of the tme, whch strongly shfts the balance from exploraton to explotaton and leads to local rather than global search. Another problem n RL s the coarse representaton of the state. For nstance, n solvng a TSP by AS as descrbed n Dorgo et al. (1996) or Ant-Q n Gambardella and Dorgo (1995), the state s the current cty, and the acton s whch cty to vst next. Clearly a full representaton of the state would contan both the current cty and the tour of ctes already vsted. Snce ths hstory obvously nfluences further assgnment, ther smple defnton loses the Markov property, and a suboptmal sequence of cte a buldng block, s not captured. Consequently, the algorthm wll not be able to handle parameter nterdependence suffcently well. As mentoned earler, the number of states n an RL approach cannot be so large as to keep an estmate for every possble sequence because the number of states grows exponentally. Nevertheles ths problem may be addressed by the use of functonapproxmaton and other means as wll be dscussed further. 2 Capturng Parameter Interdependences Usng a Bayesan Approach The coarse state representaton does not take nto account nteracton of avalable assgnments wth those already made. To nclude these nteractons we can use a Bayesan approach. To construct a feasble soluton for a combnatoral optmzaton problem, a number of free parameters should be nstantated. Let x ψ denote the fact that some free parameter x s assgned value ψ. For example, n the Quadratc Assgnment Problem descrbed n secton 4, the free parameters are locatons and the values to be assgned to those parameters are ordnal numbers of facltes. Let P ( y χ x ψ ) denote condtonal probablty of assgnng a free parameter y the value χ gven that x was already assgned ψ. Ths condtonal probablty ndcates an assgnment already made nfluences the probablty of the assgnment under consderaton. We can fnd P ( y χ x ψ ) usng Bayes rule: x ψ y χ ) P ( y χ x ψ ) = (2.1) x ψ ) where P ( y χ) s the pror probablty of assgnng y to χ and P ( x ψ s the lkelhood of assgnment

y χ wth respect to assgnment x ψ. If we set P ( x ψ ) to one, (2.1) can be smplfed to P ( y χ x ψ ) = x ψ y χ). Now suppose that a sequence of k assgnments S ( x1 ψ 1, x2 ψ 2,..., x k ψ k ) was made. The posteror probablty of assgnment y χ gven that assgnments n S are made s P ( y χ S) = S y χ). By computng ths posteror probablty for all canddate values of χ for free parameter y n consderaton, one can fnd how partcular χ fts wth assgnments already made. Note that the number of condtonal probabltes P S x ψ, x ψ,..., x ψ y ) ( 1 1 2 2 k k χ grows exponentally wth the sze of the problem. We can use the nave Bayesan approach to address ths problem. Assumng ndependence of the assgnments x ψ we get: k y χ S) = y χ) x ψ (2.2) = 1 The probablty of assgnment y χ gven that k assgnments n S were already made s the product of ts pror probablty P ( y χ), and condtonal probabltes x ψ for all pror assgnments. If n s the total number of decsons to construct a feasble soluton, then there are O(n 4 ) condtonal probabltes x ψ. Thu even the nave Bayesan scheme leads to a hgh, O(n 4 ), space complexty, whch s acceptable only for moderately szed problems. Snce pror and condtonal probabltes are not known, ther estmates P ˆ( y χ) and Pˆ ( x ψ should be determned n the course of the search. Those can be found as the frequences of assgnments co-occurrences. However, mantanng both probablty estmates and valuefuncton estmate nvolves sgnfcant ncrease of space and computatonal requrement. To avod th we may keep only value functon estmates usng dependence of actons probabltes on acton-values and polcy beng used. Under any reasonable polcy, the actons wth hgher acton-values have more chance of beng selected. From RL pont of vew, a state corresponds to a free parameter and an acton corresponds to a choce of value for that parameter. For proportonal polcy, the probablty estmate, P ˆ( s, a ), of acton a n state s proportonal to ts acton-value, Q ˆ( s, a ) : Pˆ( a) = Qˆ( a) Qˆ( a' ) a' A Pˆ( a) Qˆ( a) (2.3) where A s the set of all actons avalable n s. Assumng (2.3) for all possble polce we can use Bayesan scheme not on the probablty estmate but on normalzed actonvalues. For example, usng a Monte Carlo update rule, we can fnd the acton-values of assgnments as: ~ P ( y χ) = P ~ ( y χ) + α[ r P ~ ( y χ)] (2.4) P ~ ( x ~ ψ ) ( ) [ y χ = P x ψ y χ + + β r x ψ ] (2.5) where ~ P ( y χ) Qˆ( y, χ) Pˆ( y, χ) Pˆ( y χ ) to conform wth our notaton for assgnments and show the relatonshp wth probabltes. P ~ ( x ψ s the expected average reward for takng acton ψ n state gven that acton χ was taken n state y. The reward r can be based on the comparson wth the average ftnes avft, of the last M solutons generated: r = ( avft Ftness) / avft + 0.5 (2.6) The hgh space complexty of ths approach s a problem. Ths approach s therefore only possble for problem nstances of moderate sze unless used along wth functonapproxmaton. Also, snce only n 2 of the total of O(n 4 ) ~ entres x ψ get updated per teraton, ths method wll converge slowly. However, snce ths approach attempts to solve the problem of parameter nterdependence drectly, t has sgnfcant theoretcal mportance for the sake of comparson wth ndrect approache one of whch wll be ntroduced further. 3 The Approach Based on a Populaton of RL Search Agents Snce drect capturng of parameter nterdependences by keepng addtonal estmates s expensve, we can thnk about ndrect approaches. One such approach s as follows: we contnue to use a coarse representaton of the state but stop lookng for general preference values whch would be vald n any part of the search space. Snce coarse representaton collapses many true states of the system nto one makng them ndstngushable, the acton-values assocated wth coarse state -acton pars can only be vald for a local part of the search space. We wll call ths the prncple of localty of acton-values. However, actonvalues from dfferent parts of the search space can be more broadly applcable. Therefore, ths approach mantans a populaton of not only soluton whch are the best results of the search conducted by the RL algorthm stuated n some area of the search space, but also ther acton-values. Ths couplng of a locally-best soluton, the acton-values and an RL algorthm s defned as an agent, an expert n ts local area of the search space. As soon as we have local nformaton from dfferent parts of the search space, we x

need a way to combne the results of best yet search n one area wth another. Snce each agent n the populaton s addressng the same optmzaton problem, we expect that at least some other agent s acton-values are useful n areas other than the local space n whch there were formed. Ths assumpton of homogenety allows us to combne results from multple agents. Consder one such approach: a new soluton s formed by copyng a part of the locally-best soluton found by one agent, whle the remanng assgnments are made usng acton-values borrowed from another agent. How would ths compare to recombnng two solutons usng GA crossover? In GA crossover we have two knds of nformaton, the two nstances and perhaps some problemspecfc nformaton. For example, Grefenstette (1985) crossover for TSP has to make 40% of ts assgnments at random to avod conflcts wth prevous assgnments. Wth acton value we can drect those assgnments rather than make them randomly. Ths ncreases the chances of fndng a good sequence. Thu the operaton descrbed looks lke a knd of crossover, usng two nstances to generate one chld, based on ndrect transfer of nformaton though acton-values. We may also thnk of t as combnng both partal results and preferences resultng from search conducted by other agents. Possble varaton of ths theme s to generate a partal soluton wth one agent and use another agent to generate the remander. Approaches usng both a central soluton and acton-values are also possble. Ths synthetc approach would allow combnng the advantages of both RL and GA. In addton to capturng nterdependence a populaton of RL search agents provdes oportuntes for more global search. As was noted n the ntroducton, the local character of the search comes n part from constructng the entre soluton from scratch. Our approach uses an RL algorthm to generate not the whole soluton, but only a part of t. The other part s replcated from the best soluton found so far by ths or another RL algorthm. At frst glance ths mght seem to make the approach even more locally orented. Ths s the case only f the replcated part s dscovered by some other agent, whch followed a smlar thread of search. To enforce ndependent threads of search as conducted by each agent n the populaton, we can choose the followng replacement polcy: the chld competes wth the parent whch was the source of replcated materal, and the better soluton (parent or chld) s placed nto the next generaton. In ths case, two agents are smlar (same preference etc.) only f they dscovered the same soluton ndependently based on ther own acton-values. Another way to make the search more global s to allow the RL approach to wander more (follow less strngently ts preferences). To avod ntroducng poor solutons nto the populaton, each soluton can be passed through a problem-specfc local optmzer to see f ths exploraton found a useful area of the search space. These two approaches are complementary because ndependent threads reduce crowdng whch can cause prelmnary convergence to a local optmum. In ts turn, local optmzaton allows broader search by allowng parameters controllng exploraton n the RL algorthm to be set less tghtly. Snce an nstance n the populaton s not only a soluton, but also a matrx of acton-value t s costly to copy. Ths s one of the reasons that compettve replacement s used n the algorthm. We assume here that f the chld s better than the parent whch served as the source of replcated part, then the chld nherts all the preference values of that parent. Dependng on the results of competton, the update of the preference values s made ether n both parents or n the chld and the parent and there s no need to copy them. Intalze populaton and parameters; Repeat Select two agents A 1 and A 2 from the populaton usng e.g. proportonal selecton based on the ftness of central soluton; For each free parameter wth probablty λ do Copy the value of free parameter from A 1 to offsprng O; End For each unassgned free parameter n O do In problem specfc order: Select a value to be assgned to ths free parameter from the set of possble value accordng to some polcy based on the acton-values of A 2 and assgn t to that free parameter; End Pass O through local optmzer (optonal step); Evaluate O; f(o) denotes ftness of O; Compute reward r; If f(o) s better then the ftness of central soluton of A 1 then Copy O to central soluton of A 1; End Update acton-values of A 1 and A 2 usng reward r; Untl termnaton condton; Output best soluton n populaton Fgure 1: Hgh level pseudocode for RL-agent approach. The hgh-level pseudocode of the approach s shown n Fgure 1. There s a populaton of RL agents where each s comprsed of a locally best soluton, a matrx of actonvalues and the parameters for the RL algorthm. To produce a new agent, two solutons are selected from a populaton usng proportonal or another type of selecton. The new soluton s formed usng the soluton of one parent and the acton-values of the other. After calculatng the ftness of the new soluton, the chld competes wth the

parents for ncluson n the populaton. Then the valuefunctons are updated that completes the generaton cycle. The reward could be based on the dfference of the ftness of the new soluton and the average ftness of the parents or some other baselne. Dependng on the problem beng solved and the partcular RL algorthm used, local rewards could also be employed. 4 Applcaton to the QAP The Quadratc Assgnment Problem s a NP-hard problem of fndng a permutaton ϕ mnmzng: Z = n n C ϕ + A B (4.1) ( ) j ϕ( ) ϕ ( j) = 1 = 1 j= 1 where n s the number of facltes/locaton C j cost of locatng faclty at locaton j, A j s cost of transferrng a materal unt from locaton to locaton j, B j s the flow of materal from faclty to faclty j. The permutaton ϕ ndcates the assgnment of a faclty to a locaton. The double summaton of the products term makes the QAP hghly non-lnear. The preference values can estmate the goodness of assgnng a specfc locaton to some faclty. The result of assgnng faclty to locaton s hghly dependent on how other facltes are assgned. Ths property makes ths problem to be very nterestng subject for testng the presented approaches on. 4.1 Populaton-Based Approach In accordance wth the approach, the new feasble soluton s formed n part by replcatng the fragments of the best soluton dscovered by one of the agents and fllng n the remanng part usng the value functon of another agent. To construct a new feasble soluton, unoccuped locatons were selected n random order and assgned facltes usng ε -greedy proportonal polcy. Ths polcy wth the probablty ε chooses the faclty from the set of not-yetassgned facltes havng maxmum acton-value, or wth probablty 1-ε, chooses one of the remanng optons wth probablty proportonal to the estmate of desrablty for that assgnment. The balance between copyng the fragments of the best soluton and generatng the rest usng preference values s controlled by the parameter λ, whch s the fracton of coped values among the total number of assgnments. Thu λ =0 corresponds to use of the RL algorthm to make all assgnments. Each cell had a probablty of beng coped equal to / n n 1 λ n λ and the remanng on average ( ) postons were flled usng acton-values Q l, f ) ( reflectng desrablty of assgnng faclty f j to locaton l. In the QAP, there s no obvous order n whch assgnments should be made. It makes the applcaton of bootstrappng RL algorthms such as Q-learnng dffcult unless some order s mposed that would put a strong bas on soluton generaton. There are a number of ways to resolve ths dffculty, but n the context of the present approach, a smple Monte Carlo update (4.2) that does not requre a partcular order was used, at the prce of slower convergence. The authors used a bootstrappng Q-learnng update rule n applcaton of ths approach to the Asymmetrc Travelng Salesman Problem (ATSP) descrbed n Magkkh and Punch (1999). In ths applcaton, the acton-value Q l, f ), were learned usng ( smple Monte Carlo update: Q l, f = Q l, f + α r Q l, f (4.2) ( ) ( ) ( ( ) where reward r was calculated on the bass of the average ftness of two parents accordng to (2.6). Generated solutons were mproved by a smple 1-Opt optmzer. 4.2 Bayesan Approach To mplement the approach capturng nterdependences drectly, one RL algorthm, a replca of the agent n the populaton from the populaton-based approach, was used. Ths RL algorthm was augmented wth an n 4 matrx of condtonal average rewards. The O(n 2 ) procedure (2.5) was used to update the entres of ths matrx. The acton-values were computed accordng to (2.4). The reward r for these two updates was calculated by (2.6), where avft was an average ftness of last M solutons produced. The process of feasble soluton generaton was dentcal to the RLagent approach wth a few changes: posteror acton-values computed by (2.5) were used nstead of acton-values n (4.2). The procedure for acceptng a new soluton was also relaxed n comparson to the populaton-based approach. The new soluton s accepted as a new center f the reward s greater than some constant threshold T. Snce the average ftness of soluton decreases durng the course of mnmzaton, ths acceptance rule resembles an annealng schedule. The same ε-greedy rule was used as found n the populaton-based approach, however a dfferent parameter ε was generated for each teraton to mprove the ablty of escapng local mnma. As n the frst approach, a smple 1- Opt optmzer was used to mprove newly created solutons. 5 Results The expermental runs of the RL agent approach were based on a populaton of 50 agent whch s relatvely small for a GA, but was enough to obtan good results usng ths scheme. Roulette wheel selecton was used. The parameter λ was randomly generated n range the [0.7,0.95] for each applcaton of RL crossover. A generaton-based approach wth a crossover rate 0.1 was used. The learnng and selecton greedness parameters were n ranges [0.05,0.15] for α, and [0.4,0.95] for ε, respectvely. These ranges of parameters were found n a

seres of prelmnary (not descrbed) experments. Each of the agents n the populaton was assgned a combnaton of parameters n these ranges durng ntalzaton. Thus there was a broad range of agent type based on a random selecton of the varous control parameters. For the approach based on Bayes rule, the parameters were β =0.1, α=0.1, T=0.55, M=300. The parameters ε and λ were randomly generated n the ntervals [0.3,0.99] and [0.5,0.9], respectvely, for each teraton. Results of RL agents and Bayesan approaches based on the averagng of 10 runs over some of the benchmark problems from QAPLIB by Burkard et al. (1997) are gven n Tables 1 and 2, respectvely (n Appendx). Snce the RL-agent approach has many features n common wth AS and GA, those approaches are used as a comparson. The columns AS and GA+LS n Table 1 show the results obtaned wth AS due to Manezzo and Colorn (1998) and GA wth local search by Merz and Fresleben (1997) respectvely. The RL algorthm wth Bayesan correcton was compared wth two other one-pont methods to contrast the qualty of search and the ablty to scale. The columns AS and GRASP n Table 2 show the results obtaned wth AS due to Manezzo and Colorn (1998) and greedy randomzed search procedure (GRASP) by L et al. (1994). The RL algorthm wth Bayesan correcton was tested on the same set of benchmarks as AS n Manezzo and Colorn (1998). Unfortunately, only the values of the best-found soluton are avalable for AS and GRASP. Thu we cannot compare them on average. However, the RL algorthm wth Bayesan correcton obtaned better results n terms of the best-found soluton n 7 out of 34 benchmarks. There was a te n the remanng 27 test problems. However, the presented RL algorthm does not scale well for the larger problems. In the case of the populaton-based approach, the results are much better. The RL-agents acheved the same or better performance on all test problems n comparson to ASbased algorthms. In comparson to GA+LS, the approach presented showed results whch were better on some benchmarks and slghtly worse on the others (of 15 problem 8 better and 7 worse). It can be concluded that the RL-agents approach and GA+LS were qute compettve. One of the remarkable features s the consstency of the search n the RL-agents approach: the presented algorthms found the optmum or best-known soluton n each of the 10 runs on all small and moderate-szed nstances. Ths s not the case wth GA+LS, whch had a non-zero standard devaton of the best-found solutons even on relatvely smple QAP benchmarks such as Nug30 or Kra30a. Unfortunately, only a small number of benchmark results for GA+LS are avalable, whch precludes more detaled comparson. Comparng the two approaches presented n ths paper, the populaton of RL agents certanly wns. The RL algorthm wth Bayesan correcton cannot provde a qualty of search equvalent to the frst approach. Even though our experments show some advantage to the Bayesan approach over other known non-populaton-based technque t does not compare well wth ether RL agents or GA. RL wth Bayesan correcton has an advantage n the number of functon evaluatons n comparson wth populaton-based approache though t has hgh space requrements. 6 Conclusons and Future Work The results of the approach usng a populaton of RL agents are compettve wth the other search technques. The authors also appled ths approach to ATSP and obtaned good results (Magkkh and Punch 1999). The approaches presented addresses the two major problems of RL algorthms n applcaton to optmzaton, namely, the local character of search and coarse state representaton. It has been shown that these problems can be overcome to obtan a global search technque capable of producng good results. There are stll many ssues to be addressed. One of them s to show that the preference values are a computatonally cheaper and more precse way of mantanng desrablty n comparson to GA and other search technque and f so, under what condtons? There are many other problem such as the absence of natural orderng n QAP and many other search problem whch can result n complcatons when bootstrappng RL update rules are used. The hgh space complexty of the Bayesan approach gves t more theoretcal rather than practcal value f a tabular representaton for the value functon s used. The authors are workng on mplementaton of ths approach usng functon approxmaton. However, n spte of all these and other dffculte the results obtaned are very encouragng. References Sutton, R. and Barto, A. (1997). Renforcement Learnng: An Introducton. MIT Press. Bellman, R. (1956). A Problem n Sequental Desgn of Experments, Sakhuya, 16:221-229. Dorgo, M. (1992). Optmzaton, Learnng and Natural Algorthms. Ph.D.Thes Poltecnco d Mlano, Italy, n Italan. Dorgo, M., Manezzo, V., and Colorn, A. (1996). The Ant System: Optmzaton by a Colony of Cooperatng Agent IEEE Trans. on SMC-Part B, 26 (1):29-41, IEEE Press.

Gambardella, L. and Dorgo, M. (1995). Ant-Q: A Renforcement Learnng Approach to the Travelng Salesman Problem. In Proc. 12th Int. Conf. on Machne Learnng, 252-260, Morgan Kaufmann. Crte R. and Barto, A. (1996). Improvng Elevator Performance usng Renforcement Learnng, Advances n Neural Informaton Processng Systems: Proc. of the 1999 Conf., 1017-1023, MIT Press. Sngh, S. and Bertseka D. (1996). Renforcement Learnng for Dynamc Channel Allocaton n Cellular Telephone Systems. In Proc. of Advances n Neural Informaton Processng System 974-980, MIT Press. Zhang W. and Detterch T. (1996). Hgh Performance Job-Shop Schedulng wth a Tme-delay TD(λ) Network. In Proc. of Advances n Neural Informaton Processng System 1024-1030, MIT Press. Holland, J. (1975). Adaptaton n Natural and Artfcal Systems. Unversty of Mchcan Press. Grefenstette, J. et al (1985). Genetc algorthms for the travelng salesman problem. In Proc. of 1 st Int. Conf. of Genetc Algorthms and ther applcaton160-165, Lawrence Erlbaum Assocates Publshers. Boyan, J. and Moore A. (1998). "Learnng Evaluaton Functons for Global Optmzaton and Boolean Satsfablty", 15 Natonal Conf. on AI, AAAI. Magkkh, V. and Punch, W. (1999). An Approach to Solvng Combnatoral Optmzaton Problems Usng a Populaton of Renforcement Learnng Agents, To appear n Proc. of the Genetc and Evolutonary Computaton Conference (GECCO-99), Morgan Kaufmann. Burkard, R., Karsch, S., and Rendl, F. (1997). QAPLIB - A Quadratc Assgnment Problem Lbrary. Journal of Global Optmzaton, 10:391-403. Manezzo, V. and Colorn, A. (1998). The Ant System Appled to the Quadratc Assgnment Problem. To appear n IEEE transactons on Knowledge and Data Engneerng. Merz P. and Fresleben B. (1997a). A Genetc Local Search Approach to the Quadratc Assgnment Problem. In Proc. of the 7th Int. Conf. on GA (ICGA'97), 465-472. Appendx Table 1: Results of RL-Agent approach on QAPLIB by Burkard et al. (1997). The meanng of the columns s as follows: Benchmark the name of the benchmark; Opt./BKS. optmal or best known soluton for ths problem; Best the best result found by the populaton of RL agents n 10 runs; Average average among the best solutons found n 10 runs; Std. Dev. standard devaton of the dstrbuton of the values of the best solutons found; NFE average number of the functon evaluatons to fnd the best soluton; AS - the ftness of the best soluton obtaned by the AS by Manezzo and Colorn (1998); GA+LS the average ftness of soluton obtaned by GA wth local search as descrbed n P. Merz, B. Fresleben (1997). The best soluton among the three technques s bolded. Benchmark Opt./BKS. Best Average Std. Dev NFE AS GA+LS Bur26a 5426670 5426670 5426670 0.0 26187 5426670 N/A Bur26b 3817852 3817852 3817852 0.0 44086 3817852 N/A Bur26c 5426795 5426795 5426795 0.0 41360 5426795 N/A Bur26d 3821225 3821225 3821225 0.0 30020 3821225 N/A Bur26e 5386879 5386879 5386879 0.0 55209 5386879 N/A Bur26f 3782044 3782044 3782044 0.0 16630 3782044 N/A Bur26g 10117172 10117172 10117172 0.0 59161 10117172 N/A Chr20a 2192 2192 2192 0.0 304419 2192 N/A Chr20b 2298 2298 2298 0.0 628084 2362 N/A Chr20c 14142 14142 14142 0.0 35636 14142 N/A Chr22a 6156 6156 6156 0.0 299388 6156 N/A Chr22b 6194 6194 6194 0.0 416755 6254 N/A Esc32a 130 130 130 0.0 264 130 N/A Kra30a 88900 88900 88900 0.0 70563 88900 N/A Kra30b 91420 91420 91420 0.0 524071 91420 N/A Lpa20a 3683 3683 3683 0.0 6716 3683 N/A Lpa30a 13178 13178 13178 0.0 39671 13178 N/A Lpa40a 31538 31538 31538 0.0 178556 31859 N/A Nug20 2570 2570 2570 0.0 17524 2570 N/A Nug30 6124 6124 6124 0.0 488602 6124 6125.6 Scr20 110030 110030 110030 0.0 61332 110030 N/A Ste36a 9526 9526 9526 0.0 637048 9598 9535.6 Ste36b 15852 15852 15852 0.0 132011 15892 N/A Ste36c 8239110 8239110 8239110 0.0 1239520 8265934 N/A Sko100a 152002 152250 152374.6 104.51 4925566 N/A 152253.0 Ta60a 7208572 7299714 7305455 6818.07 5937914 N/A 7309143.4

Benchmark Opt./BKS. Best Average Std. Dev NFE AS GA+LS Ta60b 608215054 608215054 608283498 76833.33 5783030 N/A 608215040.0 Tal100a 21125314 21452028 21505981.8 27060.3 7516822 N/A 21372797.6 Tal100b 1185996137 1186007112 1187068525 998793.5 4328802 N/A 1188197862.44 Ta150b 498896643 501198597 502255500.1 704186.2 2547670 N/A 502200800.0 Ta256c 44759294 44830390 44838185.14 8639.366 68935793 N/A 44839138.3 Tho30 149936 149936 149936 0.0 2951608 149936 N/A Tho40 240516 240516 240516 0.0 3754472 242108 N/A Tho150 8134030 8166808 8170678 8149.168 9476401 N/A 8160088.0 Table 2: Results of RL algorthm wth Bayesan correcton on QAPLIB by Burkard et al. (1997). The meanng of the columns s as follows: Benchmark the name of the benchmark; Opt./BKS. optmal or best known soluton for ths problem; Best the best result found by the populaton of RL agents n 10 runs; Average average among the best solutons found n 10 runs; Std. Dev. standard devaton of the dstrbuton of the values of the best solutons found; NFE average number of the functon evaluatons to fnd the best soluton; AS - the ftness of the best soluton obtaned by the AS by Manezzo and Colorn (1998); GRASP the best ftness of soluton obtaned by greedy randomzed search procedure (GRASP) by L, Pardalos and Resende (1994) as gven n Manezzo and Colorn (1998). The best soluton among the three technques s bolded. Benchmark Opt./BKS. Best Average Std. Dev. NFE AS GRASP Bur26a 5426670 5426670 5426670 0.0 195794 5426670 5426670 Bur26b 3817852 3817852 3817852 0.0 31323 3817852 3817852 Bur26c 5426795 5426795 5426795 0.0 18675 5426795 5426795 Bur26d 3821225 3821225 3821225 0.0 18966 3821225 3821225 Bur26e 5386879 5386879 5386879 0.0 15483 5386879 5386879 Bur26f 3782044 3782044 3782044 0.0 6855 3782044 3782044 Bur26g 10117172 10117172 10117172 0.0 7131 10117172 10117172 Chr20a 2192 2192 2192 0.0 368109 2192 2232 Chr20b 2298 2394 2394 0.0 667998 2362 2434 Chr20c 14142 14142 14142 0.0 44248 14142 14142 Chr22a 6156 6156 6156 0.0 851158 6156 6298 Chr22b 6194 6194 6218.3 19.87 722198 6254 6354 Els19 17212548 17212548 17212548 0.0 5109 N/A N/A Esc32a 130 130 130.54 0.93 285046 130 132 Esc32b 168 168 168 0.0 288285 168 168 Esc32c 642 642 642 0.0 866492 642 642 Esc32d 200 200 200 0.0 765784 200 200 Esc32e 2 2 2 0.0 33 2 2 Esc32f 2 2 2 0.0 33 2 2 Esc32g 6 6 6 0.0 36 6 6 Esc64a 116 116 116 0.0 377798 N/A N/A Kra30a 88900 88900 88917.27 57.28 290167 88900 88900 Kra30b 91420 91420 91454.5 55.20 295429 91420 91710 Lpa20a 3683 3683 3683 0.0 14229 3683 3683 Lpa30a 13178 13178 13179.6 5.42 123704 13178 13178 Lpa40a 31538 31538 31593.18 123.22 124748 31859 31859 Nug20 2570 2570 2570 0.0 48307 2570 2570 Nug30 6124 6124 6125.8 2.08 485881 6124 6150 Scr20 110030 110030 110030 0.0 21265 110030 110030 Ste36a 9526 9526 9533.091 9.73 220754 9598 9698 Ste36b 15852 15852 15887.8 54.62 215278 15892 15998 Ste36c 8239110 8239110 8239187 253.87 135984 8265934 8312752 Tho30 149936 149936 149967.3 103.71 320404 149936 149936 Tho40 240516 240632 241338.2 838.35 177399 242108 243320 Sko56 34458 34502 34617.0 96.69 845275 N/A N/A