Sequenial Decision-Making wih Big Daa: Papers from he AAAI-14 Workshop Surprise and Curiosiy for Big Daa Roboics Adam Whie, Joseph Modayil, Richard S. Suon Reinforcemen Learning and Arificial Inelligence Laboraory Universiy of Albera, Edmonon, Albera, Canada Absrac This paper inroduces a new perspecive on curiosiy and inrinsic moivaion, viewed as he problem of generaing behavior daa for parallel off-policy learning. We provide 1) he firs measure of surprise based on off-policy general value funcion learning progress, 2) he firs invesigaion of reacive behavior conrol wih parallel gradien emporal difference learning and funcion approximaion, and 3) he firs demonsraion of using curiosiy driven conrol o reac o a non-saionary learning ask all on a mobile robo. Our approach improves scalabiliy over previous off-policy, robo learning sysems, essenial for making progress on he ulimae big-daa decision making problem life-long robo learning. Off-policy, life-long, robo learning is an immense big-daa decision making problem. In life-long learning he agen s ask is o learn from an effecively infinie sream of ineracion. For example, a robo updaing a 100 imes a second, running 8 hours a day, wih a few dozen sensors can produce gigabyes of raw observaion daa every year of is life. Beyond he emporal scale of he problem, off-policy life-long learning enables addiional scaling in he number of hings ha can be learned in parallel, as demonsraed by recen predicive, learning sysems (see Modayil e al 2012, Whie e al 2013). A special challenge in off-policy, life-long learning is o selec acions in way ha provides effecive raining daa for poenially housands or millions of predicion learners wih diverse needs, which is he subjec of his sudy. Surprise and curiosiy play an imporan role in any learning sysem. These ideas have been explored in he conex of opion learning (Singh e al 2005, Simsek and Baro 2006, Schembri e al 2007), developmenal robo exploraion (Schmidhuber 1991, Oudeyer e al, 2007), and exploraion and exploiaion in reinforcemen learning (see Baldassarre and Mirolli 2013 for an overview). Informally, surprise is an unexpeced predicion error. For example, a robo migh be surprised abou is curren draw as i drives across sand for he firs ime. An agen migh be surprised if is reward funcion suddenly changed sign, producing large unexpeced negaive rewards. An agen should, however, be unsurprised Copyrigh c 2014, Associaion for he Advancemen of Arificial Inelligence (www.aaai.org). All righs reserved. if is predicion of fuure sensory evens falls wihin he error induced by sensor noise. Equipped wih a measure of surprise, an agen can reac change how i is behaving o unexpeced siuaions o encourage relearning. This reacive adapaion we call curious behavior. In his paper we sudy how surprise and curiosiy can be used o adjus a robo s behavior in he face of changing world. In paricular, we focus on he siuaion where a robo has already learned wo off-policy predicions abou wo disinc policies. The robo hen experiences a physical change ha significanly impacs he predicive accuracy of a single predicion. The robo observes is own inabiliy o accuraely predic fuure baery curren draw when i execues a roaion command, exciing is inernal surprise measure. The robo s behavior responds by selecing acions o speed relearning of he incorrec predicion spinning in place unil he robo is no longer surprised hen reurning o normal operaion. This paper provides he firs empirical demonsraion of surprise and curiosiy based on off-policy learning progress on a mobile robo. Our specific insaniaion of surprise is based on he insananeous emporal difference error, raher han novely, salience, or prediced error (all explored in previous work). Our measure is unique because 1) i balances knowledge and compeence-based learning and 2) i uses error generaed by off-policy reinforcemen learning algorihms on real robo daa. Our experimen uses commodiy off-he-shelf irobo Creae and simple camera resuling in real-ime adapive conrol wih visual feaures. We focus on he paricular case of responding o a dramaic increase in surprise due o a change in he world raher han iniial learning. The approach described in his paper scales naurally o massive emporal sreams, high dimensional feaures, and many independen off-policy learners common in life-long robo learning. Background We model an agen s ineracion wih he world (including he robo s body) as a discree ime dynamical sysem. On each ime sep, he agen observes a feaure vecor x X R n, ha only parially characerizes he environmenal sae s S. We assume s is he curren sae of some unobservable Markov Decision Process (MDP), and hus x is compued from any informaion available o he 19
agen a ime. On each sep he agen akes an acion a A, resuling in an ransiion in he underlying MDP, and he agen observes a new feaure vecor x +1. In convenional reinforcemen learning, he agen s objecive is o predic he oal discouned fuure reward on every ime sep. The reward is a special scalar signal r +1 = r(x +1 ) R, ha is emied by he environmen on each sep. To predic reward he agen learns a value funcion v : S R. The ime scale of he predicion is conrolled by a discoun facor γ [0, 1). The precise quaniy o be prediced is he reurn g = k=0 γk r +k+1, and he value funcion is he expeced value of he reurn, [ ] v(s) = E π γ k S R +k+1 = s, k=0 where he expecaion is condiional on he acions (afer ) seleced according o a paricular policy π : X A [0, 1] as denoed by he subscrip on he expecaion operaor. As is common in reinforcemen learning, we esimae v wih a linear approximaion, v w (s ) = w x v(s ), where w R n. We use generalized noions of reward and erminaion o enable learning a broader class of predicions han is ypically considered in convenional reinforcemen learning. Firs noice we can define he reward o be any bounded funcion of x, such as he insananeous value of an IR sensors, which we call pseudo reward. Second, γ need no be consan, bu can also be defined as a funcion he feaures γ : X [0, 1]. These changes require he definiion of a general value funcion (GVF) [ ( v(s) = E Π k j=1 γ(x +j ) ) ] r(x +k+1 ), k=0 bu no special algorihmic modificaions are required o learn GVFs. See Modayil e al (2014) for a more deailed explanaion of GVF predicion. In order o learn many value funcions in parallel, each condiioned on a differen policy, we require off-policy learning. Off-policy reinforcemen learning allows he agen o learn predicions abou one policy while following anoher. In his seing, he policy ha condiions he value funcion he arge policy π is differen from he policy used o selec acions and conrol he robo, called he behavior policy µ : X A [0, 1]. Separaing he daa generaion policy from he policy o be learned allows learning abou many value funcions in parallel from a single sream of experience. Each predicion V (i) = v (i) w (s ) is abou fuure pseudo reward r (i) (x ) observed if he agen follows π (i) wih pseudo erminaion according o γ (i) (x ). These policy-coningen predicions can each be learned by independen insances of off-policy reinforcemen learning algorihms. One such off-policy algorihm is GTD(λ) which preforms sochasic gradien descen on he Mean Squared Projeced Bellman Error (MSPBE), wih linear ime and space complexiy. The updae equaions for GTD(λ), e π(x, a ) µ(x, a ) (x + γ(x )λe 1 ) w +1 w + α(δ e γ(x +1 )(1 λ)(e h )x +1 ) h +1 h + β(δ e (x h )x ), require an eligibiliy race vecor e R n, scalar learning rae parameers α and β, and he usual emporal difference error δ = r(x +1 ) + γ(x +1 )w x +1 w x. The MSPBE is a convex objecive ha can be direcly esimaed from daa, MSPBE(w) = E µ [δe] T E µ [xx T ] 1 E µ [δe], useful for racking he learning progress of off-policy predicions (Whie e al 2013). Experimen The objecive of his paper is o invesigae one poenial benefi of using surprise o influence decision making. Specifically we seek a concree demonsraion of adaping he behavior policy auomaically, in response o a perurbaion o he agen s sensorimoor sream. The change is unexpeced and no modelled by he agen. The change is designed o influence he accuracy of only one GVF predicion. Consider a measure of surprise based on he emporal difference error of each GVF Z (i) δ = (i) var[δ (i) ], (1) where denoes an exponenially weighed average. This measure of surprise increases when insananeous errors fall considerably ouside he mean error. A curious behavior is any policy ha uses a measure of surprise from several GVFs o influence acion selecion. Here we consider a rule-based curious behavior ha uses surprise o deermine wheher he behavior should coninue selecing acions according o arge policy π (i) or swich o anoher arge π (j). Assuming he behavior µ had seleced acions according o π (i) for k consecuive seps, he agen decides o coninue following π (i) for k more seps, or swich o a new arge policy: if Z (i) < τ hen j = argmax j i Z (j) if Z (j) < τ (2) pick j randomly µ = π (j) follow µ for k consecuive seps In our experimen repored here k was se o 120 seps or approximaely four seconds, τ was se o 0.2, and he decay rae of he exponenial average in Equaion 1 was 0.01. Inuiively we expec a curious behavior o selec acions ha encourage or faciliae re-learning of an inaccurae GVF. In he case of a perurbaion ha affecs a single GVF v (i), 20
non-reacive behavior predicion horizon), and r (1) = 1.0. Each GVF had a unique feaure represenaion. The roaion GVF s binary feaure vecor was produced by a single iling of a decaying race of he observed baery curren, wih a ile widh of 1/16h. The forward GVF used a feaure vecor consruced from 120 160 web-camera images sampled a 30 frames per second. A he sar of he experimen, 100 pixels were seleced a random, and from hese pixels eiher he luminance or color channel was seleced a random, and hese seleced values were used o consruc feaures from he mos recen image. Each value (beween 0 and 255) was independenly iled ino 16 non-overlapping iles of widh 16, producing a binary feaure vecor x, wih 16000 componens, of which 100 were acive on each ime sep (see Suon and Baro (1998) for a review of ile coding). The camera was mouned direcly on op of he robo facing forward. Boh GVFs were learned using a separae insances of GTD(λ = 0.9) wih α = 0.05 and β = 0.0005. All learning was done direcly on a raspberry pi direcly conneced o an irobo Creae wih an updae cycle of 30 ms. The perurbaion involved puing a heavy load in he back of he robo, which changes he curren draw and direcly affecs he roaion GVF. The drive speed and hus he forward GVF predicion will be unaffeced. Our experimen involved wo phases. During he firs phase (roughly en mins) he robo followed a hand-coded non-reacive behavior policy ha alernaed beween driving forward unil bumping, roaing couner clockwise in free-space (no agains he wall), and roaing away from he wall afer bump. Figure 1 shows a visualizaion of he noncurious behav load Figure 1: Non-reacive conrol of he Creae in is pen. we expec he curious behavior (2) o coninually selec acions according o he corresponding arge policy π (i) unil he surprise has subsided, and hus he new siuaion has been learned. To invesigae his hypohesis, we conduced an experimen in which wo GVFs were learned from off-policy experience. The firs GVF prediced fuure discouned baery curren draw while couner-clockwise spinning, encoded as π (0) (x, roaeccw) = 1.0, γ (0) (x) = 0.8 x X, and r (0) = baery curren. We used a simple discree acion se of {forward, reverse, sop, roae cw, and roae ccw}. The second GVF prediced he expeced number of ime seps unil bump if he robo drove forward, encoded as π (1) (x, forward) = 1.0 x X, γ (1) (x) = 0.0 on bump and 0.95 oherwise (a 6.6 second surprise surprise non-reacive behavior curious behav load ime seps Figure 2: A ime-series plo of he surprise measure for he Forward (op) and Roaion (boom) GVFs. Boh graphs show he hree phases of he experimen: 1) iniial learning under he non-reacive behavior, 2) learning under he curious behavior, and 3) learning afer load was added o he robo. Noice ha during phase 1, a spike in roaion surprise is presen. This was he robo s firs roaion jus afer bumping, which generaes a novel curren profile. A second spike in surprise occurs afer he load is added in phase 3. The laer spike is larger, bu reduces faser because he robo effecively modifies is behavior. reacive behavior during he firs phase of learning. Phase one was inerruped afer each GVF had learned o an accepable level of accuracy, afer which ime he behavior was swiched o a curious behavior policy (described in 2). Afer abou wo minues, a 5 pound load was placed in he cargo bay of he Creae. The load had a significan effec on he baery curren draw, bu was no heavy enough o affec he robo s abiliy o achieve he requesed wheel velociy for he drive forward acions. Figure 2 shows he effec of he load on each GVF s predicion via he surprise measures. The forward GVF s predicion is largely unaffeced by he load; he robo was unsurprised. The roaion GVF s pseudo reward was based on curren draw, and was herefore significanly affeced as seen in a subsanial increase in surprise. The exra load generaed a clear and obvious change in he robo s behavior. The firs ime he curious behavior seleced a couner-clockwise roaion (afer he load was added) a large increase in surprise was produced. The increased surprise caused an exended couner-clockwise roaion. When neiher surprise measure exceeds he hreshold, he curious behavior will drives forward, only occasionally execuing roaions in free space. The effec of he load produced a very disincive and exended change in behavior. Figure 3 provides wo quaniaive measures of how he acions were seleced before he load was added, during he surprise period, and afer relearning. Afer several seconds of roaion he surprise for roaing subsided below τ, consan roaion ceased, and he behavior reurned o normal operaion. The experimen was repeaed several imes varying he 21
roae arge acivaion forward arge acivaion number of roae acions roaion seleced... forward seleced... pause ime seps Figure 3: The op figure shows which GVF policy is acive while he robo was under curious behavior conrol. The boom hisogram shows he number of ime seps on which he roaion acion was seleced over he same ime period. The upper figure clearly shows ha he robo coninually seleced acions according o he roae arge policy afer he load was added (marked by he red bar), and hen reurned o normal operaion. The lower figure illusraes he same phenomenon, and also shows he variabiliy in he number of roaions before he load was added o he robo. lighing condiions, wall-clock duraion of each phase, and camera orienaion. The increase in surprise and resulan behavior modificaion was reliably demonsraed each ime. Discussion Our experimen was designed o highligh one way in which he off-policy learning progress of muliple GVFs can influence conrol. We crafed he GVFs and inroduced a change ha would be immediaely deeced, and would produce a disincive change in behavior (consan roaion). Alhough limied, our demonsraion is novel, and demonsraes a fundamenal idea in a scalable way. Surprisingly his work is he firs demonsraion of adapive conrol for he behavior policy of an off-policy learning sysem on a robo. All previous GVF learning sysems used non-adapive behaviors (see Suon e al 2010, Modayil e al 2012, Whie e al 2013). Typically inrinsically moived sysems are segregaed ino wo camps: 1) knowledge-driven sequenial ask learning (Schmidhuber 1991, Oudeyer e al 2007), and 2) compeence-driven opion or subask raining (Singh e al 2005, Simsek and Baro 2006, Schembri e al 2007). Our approach unifies hese wo approaches. A GVF can be used o represen a predicion condiioned on a fixed arge policy (as used here) and encode an sae-acionvalue funcion used o learn a policy (learned wih greedy- GQ(λ)). Therefore adaping behavior based-on GVF error drives boh predicive knowledge acquisiion and improving compeence. Finally, our work is he firs o highligh and demonsrae he role of surprise and curiosiy in a nonsaionary seing. The approach described here is small, bu surprisingly scalable. Consider an alernaive approach o reacing o change like following each arge policy in sequence. The robo migh be learning housands of differen GVFs condiioned on housands of arge polices, as in previous work (Whie e al 2013). I would be infeasible o rely on sequencing arge policies o efficienly deec and reac o changes when he number of GVFs is large. Assuming any change o he robo s world only affecs a subse of he GVFs, our approach o curiosiy will provide daa o GVF s ha are capable of more learning. A more general approach would be o learn he curious behavior wih reinforcemen learning and a surprise-based reward. If he behavior were generaed via average reward acor criic, hen he robo could balance he needs of many GVFs in is acion selecion wihou he resricion of following any arge policy. The approach explored here is he firs sep in exploring he naural synergy beween parallel off-policy learning and curious behavior. Alhough surprise is useful for adaping o non-saionariy, i can be usefully deployed for a wide range of seings. Imagine a seing where new GVFs are coninually creaed over ime. A curious behavior, in a similar way o adaping o a perurbaion, can adjus acion selecion o provide relevan daa for new GVFs. We focused here on using curiosiy afer iniial learning each GVF had been learned o high accuracy. Curiosiy can also be used during iniial learning o avoid he inheren limiaions of hand-coded behaviors. Finally, wha does an robo do when is no-longer surprised or bored? Could a curious behavior selec acions in such a way o ready iself o reac efficienly o new perurbaions or new GVFs? These quesions are lef o fuure work. Conclusion This paper provides 1) he firs measure of surprise based on off-policy GVF learning progress, 2) he firs invesigaion of reacive behavior conrol wih parallel gradien TD learning and funcion approximaion, and 3) he firs demonsraion of using curiosiy driven conrol o reac o non-saionariy all on a mobile robo. The abiliy o deermine which off-policy predicions are subsanially inaccurae, and modifying robo behavior online o improve learning efficiency is paricularly imporan in large-scale, parallel, off-policy learning sysems. Acknowledgemens This work was suppored by grans from Albera Innovaes Technology Fuures and he Naional Science and Engineering Research Council of Canada. References Baldassarre, G., Mirolli, M. (Eds.). (2013). Inrinsically moivaed learning in naural and arificial sysems. Berlin: Springer. Maei, H. R. (2011). Gradien Temporal-Difference Learning Algorihms. PhD hesis, Universiy of Albera. 22
Modayil, J., Whie, A., Suon, R. S. (2012). Muli-imescale nexing in a reinforcemen learning robo. In From Animals o Animas 12, 299 309. Oudeyer, P. Y., Kaplan, F., Hafner, V. (2007). Inrinsic Moivaion Sysems for Auonomous Menal Developmen. In IEEE Transacions on Evoluionary Compuaion 11, 265 286 Schembri, M., Mirolli, M., Baldassarre, G. (2007). Evolving inernal reinforcers for an inrinsically moivaed reinforcemenlearning robo. In Developmen and Learning, 282 287. Schmidhuber J. (1991). A possibiliy for implemening curiosiy and boredom in model-building neural conrollers. In Proceedings of he 1s Inernaional Conference on Simulaion of Adapive Behavior, 222 227. Simsek, O., Baro, A. G. (2006). An inrinsic reward mechanism for efficien exploraion. In Proceedings of he 23rd inernaional conference on Machine learning, 833 840. Singh S., Baro, A. G., Chenanez, N. (2005). Inrinsically moivaed reinforcemen learning. In Advances in Neural Informaion Processing Sysems 17, 1281 1288. Suon, R. S., Baro, A. G. (1998). Reinforcemen Learning: An Inroducion. MIT Press. Suon, R. S., Maei, H. R., Precup, D., Bhanagar, S., Silver, D., Szepesvári, Cs., Wiewiora, E. (2009). Fas gradien-descen mehods for emporal-difference learning wih linear funcion approximaion. In Proceedings of he 26h Inernaional Conference on Machine Learning. Suon, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., Whie, A., Precup, D. (2011). Horde: A scalable real-ime archiecure for learning knowledge from unsupervised sensorimoor ineracion. In Proceedings of he10h Inernaional Conference on Auonomous Agens and Muliagen Sysems. Whie, A., Modayil, J., Suon, R. S. (2012). Scaling life-long offpolicy learning. In Developmen and Learning and Epigeneic Roboics, 1 6. 23