Informaion Theoreic Approaches for Predicive Models: Resuls and Analysis Monica Dinculescu Supervised by Doina Precup Absrac Learning he inernal represenaion of parially observable environmens has proven o be a dicul problem. Sae represenaions which rely on prior models, such as parially observable Markov decision processes (POMDPs) are compuaion expensive and sensiive o he accuracy of he underlying model dynamics. Recen work by Sill and Bialek oers an informaion heoreic approach ha compresses he available hisory ino an inernal represenaion and maximizes is predicive power. We propose an alernaive algorihm ha ensures a more accurae inernal represenaion, and faser opimal policy convergence imes. In addiion, in order o validae he asympoic naure of he heoreical algorihm, we presen he rs empirical resuls in he eld, comparing he inernal represenaions reurned by boh approaches, heir accuracy, and heir predicive powers. 1 Inroducion Increasing aenion has been given o modelling parially observable sysems, where he agen is no allowed o observe he sae direcly; insead, observaions are received which allow he agen o infer informaion abou he sae. Consider he problem of an agen who ineracs wih a dynamical sysem. In his paper, we focus on an acive scenario, in which he agen has he capaciy o ac on he world, and inuence he observaions produced. Acive scenarios can be compared o passive ones, where he agen simply observes he daa produced by he world and consrucs an inernal represenaion of he sysem. Thus, in he acive scenario, he challenge becomes 1) deriving he inernal represenaion, given he available daa as well as 2) deciding on an acion sraegy. Having he abiliy o predic he oucome of an acion is cenral o all asks ha a human or aricial agen may encouner. Thus, raher han rying o solve he learning problem, in which he agen ries o discover he sysem srucure by updaing model parameers from observaion daa, we are ineresed in consrucing an inernal represenaion ha has maximal predicive power. 1
Tha is, he inernal represenaion of he sysem should be able o make good predicions abou fuure observaions, based on available daa. There have been wo predominan approaches in predicing a sequence of observaions: parially observable Markov decision processes (POMDPs), and he recenly proposed predicive sae represenaions (PSRs)[Liman e al., 22]. We discuss he advanages and disadvanages of each of he mehods, and propose an informaion heoreic approach based on ha developed by [Sill and Bialek,24]. Our approach is dieren from he above in ha he inernal represenaion considers he informaion ha boh he inernal sae, as well as he acion performed by he agen have abou he fuure. Finally, o demonsrae he eeciveness of our approach we consider he example of he oa-rese problem, creaed by [Liman e al., 22], and compare he predicive powers of he wo algorihms. 2 Background Before discussing he specic deails of each of he above approaches, we inroduce he basic erminology used in he paper. We dene a hisory, x pas as a - nie lengh sequence of acion-observaion pairs, x pas = [a k o k,..., a 1 o 1 ], where k is he lengh of he hisory and a i o i he acion-observaion pair observed a ime sep i in ha hisory. Similarly, a es is an ordered sequence of acionobservaion pairs = [a o o o,..., a 1 o 1 ], from he curren ime sep, ino he fuure. Finally, a fuure y fu is an observaion o ha will be observed afer aking acion a a ime sep. POMDPs A parially observable Markov decision process (POMDP) is a general framework for decision making under uncerainy. Formally, a POMDP is dened as a uplem = (S, A, O, b o, T, O) where he sae se S is he se of saes ha he sysem can be in, A is he discree se of acions, and O is he discree se of observaions. The se T consiss of (n n) ransiion marices T a, where Tij a is he probabiliy of reaching sae j, by aking acion a in sae i. The se O consiss of diagonal (n n) observaion marices O a,o, where O a,o ii is he probabiliy of an observaion o, given ha acion a was seleced in sae i. Finally, he (1 n) vecor b o is an iniial probabiliy disribuion over he sysem saes [Singh e al., 24]. Inuiively, he ransiion funcion T deermines he disribuion over nex saes, given ha a cerain acion was seleced in a sae, while he observaion funcion O reecs he parially observable naure of he sysem (he agen canno deermine, wih cerainy, he rue sae of he sysem). The sae represenaion in a POMDP is a (1 n) belief vecor b(h), where b i (h) is probabiliy of he sysem being in he hidden sae i, given ha he hisory h has been observed. The belief sae is updaed by compuing: 2
b(h, a, o) = b(h)t a O a,o b(h)t a O a,o e T, n where e T n is he 1 n vecor of all 1's. Thus, predicions are given by P ( h) = P (a 1 o 1...a k o k h) = b(h)t a1 O a1,o1...t a k O a k,o k e T n The main disadvanage of he POMDP approach is ha, as seen above, esimaing he sae assumes perfec knowledge of he underlying model. Algorihms ha ry o learn he dynamics require, a minimum, an assumpion abou he opology of he model. In addiion, belief sae mainenance has, in he wors case, complexiy equal o he size of he sae space, and exponenial in he number of variables. Taken ogeher wih he fac ha here exis dynamic sysems wih a nie sae space ha canno be modeled by any nie POMDPs [Jaeger, 1998], his shows ha he capabiliies of he POMDP framework are raher limied. PSRs Predicive sae represenaions are a recen approach ha ry o represen he sae of a sysem as a se of predicions of observable oucomes of experimens ha he agen can perform on he sysem. A se of ess Q = {q 1...q m } is called a linear Predicive Sae Represenaion (PSR) if, for all hisories h, he probabiliy of any es occurring can be compued as a linear combinaion of he predicions for he ess in Q. More generally, Q is a linear PSR if, for every es, here exiss a 1 q projecion vecor m, such ha, for all hisories h, P ( h) = P (h)m T. We call he ess q 1...q m in Q, he core ess of he PSR [Liman e al., 22]. In addiion, [Singh e al., 24] dene a Sysem Dynamics Marix D as an ordering over all possible ess and hisories, of all lenghs, where each of D's enries is he probabiliy of a es given a hisory, P ( h). The niely many linearly independen columns of D are he core ess of he PSR. The marix is generaed by compuing he weigh vecors m from he model parameers, for each es. A reason for ineres in PSRs is ha he sae represenaion is consruced only from acions and observaions seen, hus resuling in a less resricive model. However, empirical resuls show ha learning he PSR weigh vecors requires signican amouns of daa, making learning algorihms boh daa and compuaionally expensive [Singh, Liman e. al]. Acive Learning and Opimal Predicions [Sill and Bialek,24] recenly proposed a soluion o he specifed problem. Using principles from informaion heory, hey propose a sae represenaion wih he objecive of exracing predicive informaion from a ime series. 3
The sae represenaion is consruced as a nie se of saes s ha should 1) compress he informaion conained by he pas ime series and 2) reain maximal predicive powers. Inuiively, he beer he sae represenaion, he closer is fuure predicions should be o hose given by he available daa alone. s, such he s is a lossy compression of he pas hisories wih maximal predicive power. Using informaion heory, his can be formalized as he opimizaion principle : We wan o nd a mapping from x pas F = max p(s x pas )[I(s y fu ) λi(s, x pas )] where I(X, Y )is he muual informaion of X relaive o Y, given byi(x, Y ) = p(x,y) p(x, y) log x,y p(x)p(y). Taken ogeher wih a Markovian assumpion, he soluion o he above principle is he opimal x pas s mapping. This is he Gibbs probabiliy disribuion: p(s x pas ) p(s )e 1 λ D KL[p(y fu x pas ) p(y fu s )], where in place of he energy, we nd he Kullbach-Leibler disance: D KL [p(y fu x pas ) p(y fu s )] = p(y fu x pas ) log p(yfu x pas ) p(y fu s ) I has been proven ha, in he limi as λ, he assignmen becomes deerminisic. Thus, he opimal inernal sae ha a specic hisory is mapped o is he one ha gives a fuure predicion closes o ha given by he hisory alone (i.e. minimizes he D KL disance beween p(y fu x pas )and p(y fu s ) ): s (x pas ) = argmin s D KL [p(y fu x pas ) p(y fu s )] The bane of he above represenaion is ha he acions aken by he agen are independen of he inernal sae. Since fuure observaions are he resul of an acion being aken, he inuiion is ha he inernal sae aken ogeher wih he acion should in fac be predicive. 3 Proposed Soluion The fundamenal idea behind our approach is ha acions, aken ogeher wih he inernal sae are predicive, raher han jus he sae iself. The moivaion behind his approach is ha he fuure produced is solely caused by he acion choice; herefore, i is reasonable o expec ha predicions ha accoun for he acion sraegy should be more accurae han hose ha depend on he sae alone. Thus, he above opimizaion principle can be modied o incorporae he informaion ha he sae and acion joinly hold abou he fuure. Formally, 4
F = max p(s x pas )[I({s, a } y fu ) λi(s, x pas )] The rs erm in he equaion ensures ha he inernal represenaion maximizes he informaion held abou he fuure, while he second represens he compression of hisories ino saes. The x pas s assignmen in his case becomes sochasic. Since he acion choices are probabilisic and depend on he opimal acion policy (given by p(a x pas )), he sae-hisory mapping will be as well, and depend on he aforemenioned policy. By solving he opimizaion principle, we obain anoher Gibbs disribuion: p(s x pas ) e 1 λ where p(y fu a, s ) a p(a x pas x pas p(y fu x pas x pas ) D KL [p(y fu x pas, a )p(a x pas )p(s x pas p(a x pas )p(s x pas, a ) p(y fu s, a )], )p(x pas ) )p(x pas ) Thus, he opimal inernal sae ha a specic hisory is mapped o is: s (x pas ) = argmin s p(a x pas ) D KL [p(y fu x pas, a ) p(y fu s, a ) Algorihmic Implemenaion a Based on he heoreical analysis presened above, we have consruced an ieraive algorihm ha performs well in any parially observable seing. The algorihm is composed of hree pars. Firs, we simulae he iniial esimaes of p(y fu x pas, a ) and p(x pas ), by aking random acions and creaing a nie lengh iniial rajecory. Secondly, we ieraively solve he above wo equaions, unil p(s x pas ) converges. This resul is he opimal x pas s assignmen ha maps he specic hisory being observed o an inernal sae. Finally, an acion is aken according o an acion policy, producing a new hisory and observaion, which are used o updae he p(y fu x pas, a ) and p(x pas ) disribuions. The ieraive algorihm converges o a local opimum afer every ieraion, according o he same argumens ha were previously used by [Sill and Bialek,24]. 4 Example: Ineracively learning a oa-rese sysem Descripion We describe a very simple parially observable sysem, creaed by [Liman e al., 22]. The sysem is composed of 5 hidden saes, 2 acions and 2 observaions. The. 5
oa acion moves o he sae on he lef/righ wih uniform probabiliy, and always produces an observaion of. The rese acion moves o he sae on he far righ. If he sysem was already in ha sae, he observaion produced is a 1; oherwise, a is observed. Each observed hisory is of lengh 3 (i.e. exends 3 ime seps ino he pas). Figure 1. Floa-rese problem We have evaluaed boh he iniial algorihm proposed by Sill and Bialek, as well as our proposed algorihm, in 4 dieren scenarios: The agen has sucien knowledge of he pas (i.e. he iniial disribuions are sucien o capure all he srucure in he environmen), and he inernal represenaion is an uncompressed mapping of he available hisories (he number of inernal saes s is equal o ha of hidden saes in he sysem, namely s = 5). The agen has sucien knowledge of he pas, and he inernal represenaion is a compression of he available hisories (he number of inernal saes s is smaller han ha of hidden saes in he sysem, namely s = 3). The agen has insucien knowledge of he pas (i.e. he iniial disribuions are oo sparse and do no capure all he srucure in he environmen), and he inernal represenaion is an uncompressed mapping of he available hisories (s = 5). The agen has insucien knowledge of he pas and he inernal represenaion is an compression of he available hisories (s = 3). Empirical resuls Due o space consrains, we will only presen he resuls for he fourh scenario, where he agen has insucien knowledge of he pas, and he inernal represenaion is a compression of he available hisories. The resuls for he oher hree scenarios are consisen wih he presened one. We are concerned wih deermining which algorihm produces he inernal represenaion wih he greaes predicive power, and ha bes compresses he available daa. Opimizaion Funcion To begin, we compare he values of each of he opimizaion funcions over ime. This ensures ha he rade o beween he predicive power and he compression capaciies of he inernal represenaion is mainained. 6
F 2 1.8 1.6 1.4 1.2 1.8.6.4.2.9.8.7.6.5.4.3.2 Graph of he Opimizaion Funcion F Time Graph of P(y s) vs. P(y x) y, Fuure observaion F.14.12.1.8.6.4.2.9.8.7.6.5.4.3 Graph of he Opimizaion Funcion F Time Graph of P(y s) vs. P(y x) y, Fuure observaion Proposed algorihm Acive Learning algorihm 2 4 6 8 1 12 14 16 18 2 2 4 6 8 1 12 14 16 18 2 Figure 2 : Opimizaion funcion F. The horizonal axis in each plo is he number of ieraions of he algorihm. Noice ha he value of he opimizaion funcion in our approach increases wih he number of ieraions. This implies ha he accuracy of he inernal represenaion increases as well. Conversely, he inernal represenaion as consruced by he original Acive Learning algorihm does no vary much. In oher words, is predicive powers improves very lile pas he iniial ones. This is problemaic, as a noisy iniial disribuion will resul in a very inaccurae saehisory mapping. Lossy Compression The inernal represenaion has o be a lossy compression of he pas hisories. In oher words, he predicions given by a sae should be very similar o hose given by he hisories ha become assigned o i. Proposed algorihm Acive Learning algorihm P(y s) and P(y x) P(y s) and P(y x).1 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.2 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Figure 3 : Hisory-sae mapping. The horizonal axis in each plo conains he possible fuure observaions, namely y fu. The verical axes are p(y fu x pas ), in black, and p(y fu s ), in red. Noice ha he number of saes used in he inernal represenaion diers beween algorihms. This is because, while he original Acive Learning algorihm is deerminisic, and hus employs all 3 available inernal saes, our 7
1.9.8.7.6.5.4.3.2.1 Graph of P(y= s*,a), where s* is he opimal sae assignmen for pas 16 = [ffr], and P(y= x=,a) Time a = in P(y= s*,a) a = 1 in P(y= s*,a) a = in P(y= x,a) a = 1 in P(y= x,a).7.65.6.55.5.45.4.35 Graph of P(y= s*) and P(y= x), where s* is he opimal sae assignmen for pas 16 = [ffr] 2 4 6 8 1 12 14 16 18 2 Time P(y= s*) P(y= x,a) proposed algorihm is probabilisic. Since here are wo possible observaions, here are wo dieren possible predicions, and hus only 2 inernal saes are needed o represen he sysem. If he inernal represenaion is o be a compression of he available hisories, hen he fuure predicions as given by i (he red lines) should bes approximae he fuure predicions given by he enire daa (he black lines). The fuure informaion conained in he available hisory is beer approximaed by he 2 inernal saes as consruced by he proposed algorihm, han he 3 saes used in he Acive Learning algorihm, as can be seen from he above graph. Predicive Powers From he problem deniion, we know wih cerainy ha a oa acion will produce an observaion of. Consequenly, a good inernal represenaion should make he same predicion regardless of previous observaions. On he oher hand, he observaion produced by a rese acion depends on he acion aken on he previous ime sep. Specically, if he previous acion was a rese, we know wih cerainy ha he sysem is in he end sae. Thus, aking anoher rese acion is expeced o produce an observaion of. If he previous acion was a oa, hen he probabiliy of observing a 1 is idenical o he probabiliy of he sysem being in he end sae. Wih his in mind, we begin by considering he case when he las acion aken is a rese. Proposed algorihm Acive Learning algorihm P(y= s*,a) and P(y= x,a) P(y= s*) and P(y= x) 2 4 6 8 1 12 14 16 18 2 Figure 4 : Fuure predicions. The horizonal axis in each plo is he number of ieraions of he algorihm. The verical axis diers beween he plos. In he case of he proposed algorihm, i is p(y fu = x pas = [ffr], a = floa), in red, vs. p(y fu = s, a = floa), in blue, and p(y fu = x pas = [ffr], a = rese), in purple, vs. p(y fu = s, a = rese), in green. In he case of he original Acive Learning algorihm he verical axis is p(y fu = x pas = [ffr]), in red vs. p(y fu = s ), in blue. Here, s is he opimal sae o which he hisory is mapped. In he case of he original Acive Learning algorihm, he inernal represenaion is independen of he acion choice, and hus he fuure observaions 8
1.9.8.7.6.5.4 Graph of P(y= s*,a), where s* is he opimal sae assignmen for pas 4 = [frf], and P(y= x=,a) a = in P(y= s*,a) a = 1 in P(y= s*,a) a = in P(y= x,a) a = 1 in P(y= x,a) 2 4 6 8 1 12 14 16 18 2 Time.9.85.8.75.7.65.6.55.5 Graph of P(y= s*) and P(y= x), where s* is he opimal sae assignmen for pas 4 = [frf] Time P(y= s*) P(y= x,a) depend on he sae alone. Noice ha he predicions given by he proposed inernal represenaion converge very quickly o he value of he predicions given by he hisory alone, hus proving is accuracy. This convergence is faser han ha of he original algorihm. Proposed algorihm Acive Learning algorihm P(y= s*,a) and P(y= x,a) P(y= s*) and P(y= x).45 2 4 6 8 1 12 14 16 18 2 Figure 5: Fuure predicions. The horizonal axis in each plo is he number of ieraions of he algorihm. The verical axis diers for boh plos. In he case of he proposed algorihm, i is p(y fu = x pas = [frf], a = floa), in red, vs. p(y fu = s, a = floa), in blue, and p(y fu = x pas = [frf], a = rese), in purple, vs. p(y fu = s, a = rese, in green. In he case of he original Acive Learning algorihm he verical axis is p(y fu = x pas = [frf]), in red vs. p(y fu = s ), in blue. Here, s is he opimal sae ha he hisory ges mapped o. Since he previous acion was a oa, he sysem can be found in eiher he end sae, or he one o he lef of i, each wih a probabiliy of.5. Thus, he probabiliy of seeing an observaion of afer aking anoher rese acion is also.5. Again, he inernal represenaion consruced by he proposed algorihm predics fuure observaions ha are closer o hose prediced from all he available daa alone. Clusering Finally, we are ineresed in wheher he predicions given by he inernal represenaion are consisen. In oher words, wheher he same ype of hisories consisenly ge mapped ogeher, over ime. 9
Proposed algorihm Acive Learning algorihm Graph of similar clusers o pas 16=[ffr] vs.ime 7 6 6 5 5 4 4 xpas xpas Graph of similar clusers o pas 16=[ffr] vs.ime 7 3 3 2 2 1 1 1 1 2 4 6 8 1 Time 12 14 16 18 2 2 4 6 8 1 Time 12 14 16 18 2 Figure 6 : Hisory clusers. The horizonal axis in each plo is he number of ieraions of he algorihm. The verical axis represens all he hisories of lengh 3 ha ge mapped o he same sae as xpas = [f f r]. Proposed algorihm Acive Learning algorihm Graph of similar clusers o pas =[fff] vs.ime 7 6 6 5 5 4 4 xpas xpas Graph of similar clusers o pas =[fff] vs.ime 7 3 3 2 2 1 1 1 2 4 6 8 1 Time 12 14 16 18 2 1 2 4 6 8 1 Time 12 14 16 18 2 Figure 7 : Hisory clusers. The horizonal axis in each plo is he number of ieraions of he algorihm. The verical axis represens all he hisories of lengh 3 ha ge mapped o he same sae as xpas = [f rf ]. Using he same argumen used in he lossy compression case, we expec o see wo di eren clusers being formed by he proposed algorihm. All he hisories ha are mapped o he same sae as xpas = [f f r] are, in fac, hisories ha end in eiher an r or r1 acion-observaion pair. This is because he fuure predicion based on each of hese hisories is known wih cerainy. Consequenly, he remaining hisories will be mapped o he second sae, as he fuure predicion in each of hese cases is probabilisic. If hese hisories were o be mapped o wo di eren saes, raher han one, i may be possible o ell small probabiliy di erences in he case of he rese acion. However, i is unrealisic o assume ha his could happen, as he agen is learning from noisy and insu cien daa. This however, is no he case when considering he original Acive Learning algorihm. Since he sae-hisory mapping is deerminisic, in order o reain good predicive powers, he saes need o be more spread ou among he possible hisories. 1
5 Conclusion We have presened an informaion heoreic approach o he problem of predicing fuure observaions from limied available daa, by proposing ha he inernal represenaion, ogeher wih he acion sraegy employed by he agen, should reain maximal predicive powers, and compress he pas hisories seen by an agen. We have compared our proposed algorihm o ha of [Sill and Bialek,24] in four dieren asks. The experimenal resuls illusrae an increase in he predicive power of he inernal represenaion, as well as a faser convergence o he opimal sae-hisory mapping. These experimens sugges ha our approach of considering acions in creaing he inernal sae represenaions is a viable soluion for predicing parially observable sysems, using a limied amoun of daa. Fuure work will analyze furher problems ha can be addressed by our approach, as well as compare our inernal represenaion of he world o ha consruced by a PSR. References [Sill and Bialek,24] Susanne Sill and William Bialek. Acive Learning and opimal predicions. 24. [Liman e al., 22] [Singh e al., 24] [Jaeger, 1998] Michael Liman, Richard S. Suon and Sainder Singh. Predicive represenaions of sae. In NIPS 14, pages 1555-1561, 22. Sainder Singh, Michael R. James, and Mahew R. Ruddary. Predicive sae represenaions: A new heory for modelling dynamical sysems. In Proceedings of UAI, pages 512-519, 24. Herber Jaeger. Discree-ime, discree-valued observable operaor models: a uorial. In GMD Repor 42, 1998. [Singh, Liman e. al] Sainder Singh, Michael L. Liman, Nicholas K. Jong, David Pardoe, Peer Sone. Learning Predicive Sae Represenaions. In The Twenieh Inernaional Conference on Machine Learning (ICML-23), 23. [Lovejoy, 1991] [Tishby e al., 1999] William S. Lovejoy. A survey of algorihmic mehods for parially observable Markov decision processes. In Annals of Operaions Research, 28, pages 47-65, 1991. Nafali Tishby, Fernando Pereira, William Bialek. The informaion boleneck mehod. In Proceedings of he 37h Annual Alleron Conference, pages 368-377, 1999. 11