AsstO: A Qualtatve MDP-based Recommender System for Power Plant Operaton AsstO: Un Sstema de Recomendacones basado en MDPs Cualtatvos para la Operacón de Plantas Generadoras Alberto Reyes 1, L. Enrque Sucar 2 and Eduardo F. Morales 2 1 Insttuto de Investgacones Eléctrcas; Av. Reforma 113, Palmra, Cuernavaca, Morelos, 62490, Méxco; areyes@e.org.mx 2 INAOE; Lus Enrque Erro 1, Sta. Ma. Tonantzntla, Puebla 72840, Méxco; {esucar, emorales@naoep.mx} Artcle receved on July 15, 2008; accepted on Aprl 03, 2009 Abstract Ths paper proposes a novel and practcal model-based learnng approach wth teratve refnement for solvng contnuous (and hybrd) Markov decson processes. Intally, an approxmate model s learned usng conventonal samplng methods and solved to obtan a polcy. Iteratvely, the approxmate model s refned usng varance n the utlty values as partton crteron. In the learnng phase, ntal reward and transton functons are obtaned by samplng the state acton space. The samples are used to nduce a decson tree predctng reward values from whch an ntal partton of the state space s bult. The samples are also used to nduce a factored MDP. The state abstracton s then refned by splttng states only where the splt s locally mportant. The man contrbutons of ths paper are the use of samplng to construct an abstracton, and a local refnement process of the state abstracton based on utlty varance. The proposed technque was tested n AsstO, an ntellgent recommender system for power plant operaton, where we solved two versons of a complex hybrd contnuous-dscrete problem. We show how our technque approxmates a soluton even n cases where standard methods explode computatonally. Keywords: Recommender systems, power plants, Markov decson processes, abstractons. Resumen Este artículo propone una técnca novedosa y práctca de aprendzaje basada en modelos con refnamento teratvo para resolver procesos de decsón de Markov (MDPs) contnuos. Incalmente, se aprende un modelo aproxmado usando métodos de muestreo convenconales, el cual se resuelve para obtener una polítca. Iteratvamente, el modelo aproxmado se refna con base en la varanza de los valores de la utldad esperada. En la fase de aprendzaje, se obtenen las funcones de recompensa nmedata y de transcón medante muestras del tpo estado-accón. Éstas prmero se usan para nducr un árbol de decsón que predce los valores de recompensa y a partr del cual se construye una partcón ncal del espaco de estados. Posterormente, las muestras tambén se usan para nducr un MDP factorzado. Fnalmente, la abstraccón de espaco de estados resultante se refna dvdendo aquellos estados donde pueda haber cambos en la polítca. Las contrbucones prncpales de este trabajo son el uso de datos para construr una abstraccón ncal, y el proceso de refnamento local basado en la varanza de la utldad. La técnca propuesta fue probada en AsstO, un sstema ntelgente de recomendacones para la operacón de plantas generadoras de electrcdad, donde resolvmos dos versones de un problema complejo con varables híbrdas contnuas y dscretas. Aquí mostramos como nuestra técnca aproxma una solucón aun en casos donde los métodos estándar explotan computaconalmente. Palabras clave: Sstemas de recomendacones, plantas generadoras, procesos de decsón de Markov, abstraccones. 1 Introducton Markov Decson Processes (MDPs) [18] have developed as a standard method for decson-theoretc plannng. Tradtonal MDP soluton technques have the drawback that they requre an explct state representaton, lmtng ther applcablty to real-world problems. Factored representatons [6] help to address ths drawback va compactly specfyng state-spaces n factored form by usng dynamc Bayesan networks or decson dagrams. Gven that
6 Alberto Reyes, L. Enrque Sucar and Eduardo F. Morales algorthms for plannng usng MDPs stll run n tme polynomal n the sze of the state space, they do not guarantee that a factored model for hgh dmensonal domans wll be solved effcently. Abstracton and aggregaton methods gve us the tools to deal wth these dffcultes so that plannng n real world problems can become tractable. However, these technques generally apply only to problems wth dscrete state and acton spaces. The problem wth contnuous MDPs (CMDPs) s that f the contnuous space s dscretzed to fnd a soluton, the dscretzaton causes yet another level of exponental blow up. Ths curse of dmensonalty has lmted the use of the MDP framework, and overcomng t has become a relevant topc of research. Two recent methods to solve CMDPs are grd-based MDP dscretzatons and parametrc approxmatons. The dea behnd the grd-based MDPs dscretzatons technque s to dscretze the state-space n a set of grd ponts and approxmate value functons over such ponts. Unfortunately, classc grd algorthms scale up exponentally wth the number of state varables [5]. An alternatve way to solve a contnuous-state MDP s to approxmate the optmal value functon V() s wth an approprate parametrc functon model [4]. The parameters of the model are ftted teratvely by applyng one step Bellman backups to a fnte set of state ponts arranged on a fxed grd or obtaned through Monte Carlo samplng. A least squares crteron s used to ft the parameters of the model. In addton to parallel updates and optmzatons, on-lne update schemes based on gradent decent [4] can be used to optmze the parameters. The dsadvantages of these methods are ther nstablty and possble dvergence [3]. Several authors, e.g., [17], use the notons of abstracton and aggregaton to group states that are smlar wth respect to certan problem characterstcs to further reduce the complexty of the representaton or the soluton. Feng [11] proposes a state aggregaton approach for explotng the structure of MDPs wth contnuous varables. The state space s dynamcally parttoned nto regons where the value functon s the same throughout each regon. L et al. [15] address hybrd state spaces usng a dscretzaton-free approach called lazy approxmaton and present a comparson wth the Feng s work fndng that ther method produced reasonable and consstent results n a more complex verson of the planet rover doman (also used by Feng). Hauskrech [13] shows that approxmate lnear programmng s able to solve factored contnuous MDPs. Smlarly, Guestrn [12] presents a framework to model and solve factored MDPs for both dscrete and contnuous problems n collaboratve settngs. Our approach s related to ths work; however t dffers on several aspects. Frst, t s based on qualtatve models, whch are partcularly useful for domans wth contnuous state varables. It also dffers n the way n whch the abstracton s bult. We use tranng data to learn a decson tree for the reward functon, from whch we deduce an abstracton called qualtatve states. There has been other work on varable-resoluton grds [16,7], however, most of them start from a unform grd. The dea of refnng an ntal abstracton for dscrete state spaces has been also suggested n [1], however we ntroduce a dfferent refnement crtera. The ntal abstracton s refned and mproved va a local teratve process. States wth hgh varance n ther value wth respect to neghborng states are parttoned, and the MDP s solved locally to mprove the polcy. At each stage n the refnement process, only one state s parttoned, and the process fnshes when any potental partton does not change the polcy. In our approach, the reward functon and transton model are learned from a random exploraton of the envronment, and can work wth both, pure contnuous spaces; or hybrd, wth contnuous and dscrete varables. Algorthms such as lke Dyna-Q or prortzed sweepng (e.g., see [21]) from the renforcement learnng communty, have been used to learn a transton model whle explorng the envronment. In contrast to these and other prevous approaches, our method learns automatcally both an abstracton and a model by just samplng the envronment. Ths abstracton s teratvely refned based on local nformaton, makng the refnement very effcent. Thus, our method s, on one hand, smpler than other abstracton and refnement approaches; and on the other hand, t automatcally bulds the model and abstracton. The man contrbutons are the use of samplng to construct an abstracton, and a local refnement of the ntal abstracton based on utlty varance. We have tested our method n a hgh-dmensonal problem n the power plant doman, n whch the state space can be ether contnuous or hybrd contnuous-dscrete. We show how our technque approxmates a soluton even n cases where standard methods explode computatonally. The rest of the paper s organzed as follows. The next secton descrbes our doman of nterest and the assocated plannng problem. Secton 3 gves a bref ntroducton to MDPs and ther factored representaton. Secton 4 develops the abstracton process and a procedure to learn such abstracton from data. Secton 5 explans the
AsstO: A Qualtatve MDP-based Recommender System for Power Plant Operaton 7 refnement stage. Secton 6 presents AsstO, a recommender system for power plant operaton, whch mplements the noton of qualtatve MDPs n ts plannng subsystem; and the emprcal evaluaton s descrbed. We conclude wth a summary and drectons for future work. 2 Applcaton Doman Our doman of nterest les on the steam generaton system of a combned-cycle power plant. Ths system, whch s amed to provde superheated steam to a steam turbne, s bascally composed by a recovery steam generator, a recrculaton pump, control valves and nterconnecton ppes. A heat recovery steam generator (HRSG) s a process machnery capable of recoverng resdual energy from a gas turbne exhaust gases to generate hgh pressure (Pd) steam n a specal tank (steam drum). The recrculaton pump s a devce that extracts resdual water from the steam drum to keep a water supply n the HRSG (Ffw). The result of ths process s a hgh-pressure steam flow (Fms) that keeps runnng a steam turbne to produce electrc energy (g) n a power generator. The man control elements assocated are the feed-water valve (fwv) and the man steam valve (msv). The complete process control doman s shown n fgure 1. Durng normal operaton, a three-element feed water control system (ecs) commands the feed-water control valve (fwv) to regulate the level (dl) and pressure (pd) n the drum. However, ths tradtonal controller does not consder the possblty of falures n the control loop (valves, nstrumentaton, or any other process devces). Furthermore, t gnores whether the outcomes of executng a decson wll help n the future to ncrease the steam drum lfetme, securty, and productvty. So, the problem s to obtan a functon that maps plant states to recommendatons that consders all these aspects. Under the MDP framework, the potental falures are consdered mplctly n a transton functon, and the securty and productvty goals are ncluded n the reward. Thus, MDPs provde an adequate model for ths problem; however, standard solutons explode computatonally and can not deal wth contnuous varables. Next we gve a bref revew of MDPs, and then we present our method for solvng contnuous and complex MDPs, requred for the power plan doman. Fg. 1. A smplfed dagram of steam generaton process. Amed to provde superheated steam to a turbne, the steam generaton system s bascally composed of a recovery steam generator, a recrculaton pump, control valves and nterconnecton ppes
8 Alberto Reyes, L. Enrque Sucar and Eduardo F. Morales 3 Factored Markov Decson Processes A Markov decson process (MDP) [18] models a sequental decson problem, n whch a system evolves n tme and s controlled by an agent. The system dynamcs s governed by a probablstc transton functon Φ that maps states S and actons A to new states S. At each tme, an agent receves a reward R that depends on the current state s and the appled acton a. Thus, they solve the problem of fndng a recommendaton strategy or polcy that maxmzes the expected reward over tme and also deals wth the uncertanty on the effects of an acton. Formally, an MDP s a tuple M =< S, A,Φ, R >, where s a fnte set of states {s,, s }. A s a fnte S 1 n set of actons for all states. Φ : A S S s the state transton functon specfed as a probablty dstrbuton. The probablty of reachng state s by performng acton a n state s s wrtten as Φ ( ass,, ). R: S A R s the reward functon. R( sa, ) s the reward that the agent receves f t takes acton a n state s. For the dscrete dscounted nfnte-horzon case wth any gven dscount factorγ, there s a polcy π that s optmal regardless of the startng state and that satsfes the Bellman equaton [2]: π π V () s = max {R( s, a) + γ Φ ( a, s, s ) V ( s )} a s S In Contnuous Markov Decson Processes (CMDPs) the optmal value functon satsfes the Bellman fxed pont equaton: V ( s ) = max [ R ( s, a ) + γ Φ ( a, s, s ) V ( s ) ds ] a (2) s Two methods for solvng these equatons and fndng an optmal polcy for an MDP are: (a) dynamc programmng [18] and (b) lnear programmng. In a factored MDP, the set of states s descrbed va a set of random varables X = {X 1,, X n }, where each X takes on values n some fnte doman Dom( X ). A state s defnes a value x Dom( X ) for each varable X. The transton model can be exponentally large f t s explctly represented as matrces, however, the frameworks of dynamc Bayesan networks (DBN) [10] and decson trees [19] gve us the tools to descrbe the transton model and the reward functon concsely. (1) Fg. 2. A smple DBN wth 5 state varables for one acton (left). Influence Dagram denotng a reward functon (center). Structured condtonal reward (CR) represented as a bnary decson tree (rght)
AsstO: A Qualtatve MDP-based Recommender System for Power Plant Operaton 9 Let X denote a varable at the current tme and X ' the varable at the next step. The transton graph of a DBN s a two layer drected acyclc graph G whose nodes are{ X,, X, X ',, X ' n }, see fgure 2 (left). Each node ' T 1 n 1 X s assocated wth a condtonal probablty dstrbuton (CPD) P ( X ' Parents( X ' )), whch s usually represented by a matrx (condtonal probablty table) or more compactly by a decson tree. The transton probablty Φ ( as,, s ) s then defned to be ΠPΦ ( x' u ) where u represents the values of the varables n Parents( X ' ). The next value X', often depends on a small subset of varables (Parents(X')) smplfyng the transton functon. The reward assocated wth a state often depends only on the values of certan features of the state. The relatonshp between rewards and state varables can be represented wth value nodes n nfluence dagrams, as shown n fgure 2 (center). The condtonal reward tables (CRT) for such a node s a table that assocates a reward wth every combnaton of values for ts parents n the graph. Ths table s locally exponental n the number of relevant varables. Although n the worst case the CRT wll take exponental space to store the reward functon, n many cases the reward functon exhbts structure allowng t to be represented compactly usng decson trees or graphs, as shown n fgure 2 (rght). 4 Qualtatve MDPs Although factored MDPs provde mportant reductons n the representaton of transton and reward functons, n cases of problems wth hgh dmensonalty there can stll be a large number of states nvolved. On the other hand, defnng a sutable partton of the state space by a human expert s not an easy task. In ths paper, we propose a novel approach to automatcally defne abstract states, and a procedure to approxmate a decson model from data. In the proposed method, we gather nformaton about the rewards and the dynamcs of the system by explorng the envronment. Ths nformaton s used to buld a decson tree [20] representng a small set of abstract states (called the qualtatve partton) wth equvalent rewards, and then s used to learn a probablstc transton functon usng a Bayesan network learnng algorthm [9]. The resultng approxmate MDP model can be solved usng tradtonal dynamc programmng algorthms. 4.1. Qualtatve states A qualtatve state 1 (or q state), q, s a set of states (or a partton of the state space n the contnuous case) that share smlar mmedate rewards. A qualtatve state space, Q, s a set of q states: q1, q2,.. qn, also called the qualtatve partton. Smlarly to the reward functon n a factored MDP, the qualtatve constrans that dstngush regons of the state space wth dfferent reward values, can be represented by a decson tree called Reward Decson Tree (RDT). Snce a qualtatve state maps drectly a reward value, a qualtatve partton Q can also be represented by a bnary decson tree (Q tree). In order to obtan a Q tree, a reward decson tree (RDT) s frst nduced from smulated data and then transformed by smply renamng the reward values to q-state labels. Each leave n the Q tree s labeled wth a new qualtatve state. Even for leaves wth the same reward value, we assgn a dfferent qualtatve state value. Ths produces more states but at the same tme creates more gudance that helps to produce more adequate polces. Fgure 3 llustrates ths tree transformaton for a smple two dmensonal case that represents a Temperature-Volume dagram for an deal gas. Φ 1 Although other authors have used the term qualtatve n a temporal sense, ths work refers to qualtatve n a relatonal spatal sense.
10 Alberto Reyes, L. Enrque Sucar and Eduardo F. Morales Fg. 3. Transformaton of the reward decson tree (left) nto a Q-tree (rght). Internal nodes n both trees represent contnuous varables and edges evaluate whether ths varable s less or greater than a partcular bound. Leaf nodes n the RDT represent rewards, and n the Q-tree are q-states Each branch n the Q tree denotes a set of constrants for each q state, q, that bounds a contnuous regon. For example, a qualtatve state could be a regon n a Temperature Volume dagram bounded by the constrants: Temp > 306 and Vol > 48. Fgure 4 llustrates the constrants assocated to the example presented above, and ts representaton n a 2-dmensonal space. It s evdent that a qualtatve state can cover a large number of states (f we consder a fne dscretzaton) wth smlar propertes. Fg. 4. In a Q-tree (left), branches are constrants and leaves are qualtatve states. A graphcal representaton of the tree s also shown (rght). Note that when an upper or lower varable bound s nfnte, t must be understood as the upper or lower varable bound n the doman 4.2. Qualtatve MDP Model Specfcaton We can defne a qualtatve MDP as an MDP wth a qualtatve state space. A hybrd (or qualtatve dscrete) MDP s a factored MDP wth a set of qualtatve and dscrete factors. In ths case, we have a set of dscrete varables, and the qualtatve state space Q, whch s an addtonal factor that concentrates all the contnuous varables. Intally, only the contnuous varables nvolved n the reward functon are consdered n the learnng algorthm. Other contnues varables are dscretzed arbtrarly; however, ths ntal dscretzaton s mproved n the refnement stage,
AsstO: A Qualtatve MDP-based Recommender System for Power Plant Operaton 11 as descrbed n Secton 5. Thus, a hybrd qualtatve-dscrete state s descrbed n a factored form as s h = { X 1,, Xn, Q}, where X 1, X, n are the dscrete factors, and Q s a factor that represents the relevant contnuous dmensons n the reward functon. 4.3. Learnng Qualtatve MDPs The Qualtatve MDP model s learned from data based on a random exploraton of the envronment that allows recordng state transtons, actons taken, and the assocated reward values. To better understand how a tranng data set s recorded, consder the b-dmensonal doman descrbed above, but now assumng that the system state can be modfed by changng the temperature and volume values. The possble actons are ncrease/decrease the temperature, ncrease/decrease the volume, and do nothng (the null acton). Fgure 5 shows graphcally a possble data trace produced by the random applcaton of dfferent actons on the system. Each dot n the fgure represents a partcular state (volume and temperature) that results after the applcaton of a partcular acton. Each state s assocated also to a reward value, whch corresponds to the dfferent regons n fgure 5. Thus, after explorng the envronment we obtan a data set that records for each acton, sequentally from t = 1 to N, the acton, resultng state and reward. So for the gas example, each data record wll contan: Data =(Temperature, Volume, Acton, Reward). From ths data set, a decson model s obtaned, and then solved usng the value teraton algorthm. Formally, ths dea can be descrbed as follows. Gven a set of state transtons represented as a set of random j varables, O = { Xt, A, X t+ 1 }, for j = 12,,..., N, for each state and acton A executed by an agent, and a j reward (or cost) R assocated to each transton, we learn a qualtatve factored MDP model: 1. From a set of smulated transtons { OR, } nduce a reward decson tree, RDT, that predcts the reward functon R n terms of contnuous and dscrete state varables, X 1, X, k, Q. For the gas example, ths tree corresponds to the one shown n Fgure 3, left. 2. Obtan from the decson tree ( RDT ) the set of constrants for the contnuous varables relevant to determne the qualtatve states (q states) n the form of a Q-tree. In terms of the doman varables, we obtan a new varable Q representng the reward-based qualtatve state space whose values are the q states. Ths transformaton s llustrated n Fgure 3 for the deal gas example, wth the resultng Q-tree (rght). Ths Q-Tree s shown agan n Fgure 4 (left), whch also shows the qualtatve partton obtaned (rght), where the state space s dvded nto 5 qualtatve states, q0, q 1, q4. 3. Qualfy data from the orgnal sample n such a way that the new set of attrbutes s the Q varable, the remanng dscrete and contnuous varables not ncluded n the decson tree, and the acton A. The contnuous varables not consdered n the RDT tree are dscretzed n a coarse way wth equal sze ntervals (ths ntal dscretzaton s mproved n the refnement stage). Ths transformed data set s called the qualfed data set. For the example, the state n each record n the data set wll be represented by the correspondng qualtatve state, q q 0 4, nstead of the numerc values of the orgnal state varables, Vol. and Temp. These q states are determned n terms of the partton of the state space, as shown n Fgure 4. 4. Format the qualfed data set n such a way that the attrbutes follow a temporal causal orderng. For example varable Qt must be set before Q t + 1, X 1 t before X 1 t+ 1, and so on. The whole set of attrbutes should be the varable Q n tme t, the remanng system varables, X 1, X, k, n tme t, the varable Q n tme t + 1, the remanng system varables n tme t + 1, and the acton A. Thus, for the gas example, each record n the qualfed data set wll be: ( q, a, r) t, where q s the q state, a s the acton, r s the reward, and t s tme, from t = 0 to t = N ( N s the number of steps n the exploraton).
12 Alberto Reyes, L. Enrque Sucar and Eduardo F. Morales 5. Prepare data for the nducton of a 2-stage dynamc Bayesan net. Accordng to the acton space dmenson, splt the qualfed data set nto A sets of samples, one for each acton. In the gas case there wll be 5 sets, one for each possble acton: ncrease/decrease the temperature, ncrease/decrease the volume, and do nothng. 6. Induce the transton model for each acton, A j, usng a Bayesan network learnng algorthm [9]. So for our runnng example, we wll nduce a DBN to represent the transton model for each of the 5 actons, all n terms of the q state varables. Fg. 5. Exploraton trace for the deal gas doman. Each dot n the fgure represents a data pont n the exploraton, wth ts correspondng state (Vol. and Temp.), reward (determned by the regon), and acton appled to reach ths state. Thus, by applyng random actons on the system, t s possble to capture the effects of these actons (new states) and the mmedate reward receved per state At the end of ths process we have learned a qualtatve MDP model of the problem based on a random exploraton of the envronment, and the qualtatve partton obtaned from the reward decson tree. In ths model, the transton functon s represented as a set of 2 stage DBNs, one per acton, and the reward by a decson tree; both n terms of the q state varables. As mentoned before, f there are addtonal varables that are not part of the reward functon, these are just ncorporated nto the model. Ths ntal model represents a hgh-level abstracton of the contnuous state space and can be solved effcently usng a standard technque, such as value teraton, to obtan the optmal polcy. For nstance, n the deal gas example, the resultng polcy wll gve the optmal acton for each q-state, q q 0 4. Ths approach has been successfully appled n several domans; however, n some cases the ntal abstracton can mss some relevant detals of the doman and consequently produce sub-optmal polces. We mprove ths ntal partton through a refnement stage descrbed n the next secton. 5 Qualtatve State Refnement We have desgned a value-based algorthm that recursvely selects and parttons abstract states wth hgh utlty varance. If there are contnuous dmensons that were not ncluded n the ntal Q-tree (because they do not affect the reward), these are ncorporated at ths stage. For ths, we smply extend the Q-tree wth the addtonal dmensons wth an ntal, coarse dscretzaton. Before we see n detal the refnement algorthm, we need to defne some relevant concepts.
The border of state, AsstO: A Qualtatve MDP-based Recommender System for Power Plant Operaton 13 s, s defned as the set of states, S { s s } j =,,, such that s S s a neghbor of ; that s, they are adjacent n at least one dmenson. A regon s defned as r = s S, that s, a state and ts border states. For nstance, n the deal gas example, and are the border states of, and 1 q0 q4 3 fgure 4. The utlty varance of a regon, r, that corresponds to state s, s defned as: S 1 n 2 2 r = ( ) Vqk rn n k = 1 n j k j q r { q q q } 3 3 0 4 s =,,, see V (3) where n s the number of border states for s, V s the value of each state, s, n the regon, and s the qk average value of the states n the regon. The value for each state s obtaned when we solve the qualtatve MDP, as descrbed n the prevous secton. The utlty gradent gves the dfference n utlty between one state, s, and one of ts border states, sk, and t s defned as follows: k V r n δ = V V (4) k of ts The hyper-volume of a state, d dmensons: s, corresponds to the space occuped by the state and ts obtaned by the product hv d = x (5) l= 1 l where x l s the value for each dmenson l. The refnement algorthm has as nput the ntal qualtatve partton obtaned n the learnng stage and an ntal soluton for ths qualtatve MDP. It also requres a mnmum hyper-volume for a state defned by the user, as ths depends on the applcaton. It proceeds as follows: 1. Intalze all the states as unmarked. 2. Whle there s an unmarked qualtatve state greater than the mnmum hyper-volume: (a) Save a copy of the prevous MDP (before the partton) and ts soluton. (b) Obtan the utlty varance for each state n ts correspondng regon. (c) Select a qualtatve state wth the hghest varance n ts utlty value wth respect to ts neghbors, name t q. (d) For the qualtatve state q select a contnuous dmenson to splt t, from ( x0, x 1,, xn ), such that t has the hghest utlty gradent wth respect to ts border states along ths dmenson. (e) Bsect the q-state q over the selected dmenson (dvde the state n two). (f) Solve the new MDP, whch ncludes the new partton, usng value teraton. (g) If the new MDP has the same polcy as before, mark the orgnal state q before the partton, and return to the prevous MDP, otherwse, accept the refnement and contnue. 3. Return the fnal partton and ts soluton.
14 Alberto Reyes, L. Enrque Sucar and Eduardo F. Morales The refnement process s now descrbed for the deal gas example. Fgure 6 llustrates 3 steps n the abstracton process for the example n fgure 4. The ntal partton s shown at the top left. Let us assume that the state has the hghest varance n utlty wth respect to ts neghbors, q1, q2, q3, q4; and that Vol. s the dmenson wth the hghest dfference n utlty. A bsecton s then nserted to splt state n the new states and q (Step 1, top q0 q0 1 rght). The remanng states are relabeled to preserve a progressve numberng. After solvng the new MDP and verfyng that the polcy has changed, the bsecton s accepted and the algorthm proceeds to Step 2 (bottom-left). In ths case q1 s the state wth the hghest varance and t s splt on the Temp. dmenson whch s the dmenson wth the hghest dfference n utlty. However, after solvng the new MDP, the polcy does not change, so the dvson s canceled and t returns to the prevous partton, as depcted n the bottom-rght of fgure 6. Thus, ths state wll be marked and not consdered for subsequent parttons. q 0 Fg. 6. An example of the qualtatve refnement process for a two-dmenson state space. Intal partton: the ntal soluton obtaned before, for each q state ts value and optmal acton are shown. Step 1: the state wth hghest varance s bsected along the dmenson wth hghest varance, Vol. Note that the q states have been q 0 q 1 renamed. Step 2: now s parttoned along the Temp. dmenson. Step 3: as there s no change n polcy for the partton n Step 2, t returns to the partton n Step 1 Next we descrbe how the qualtatve MDP approach was appled n the power plant doman.
6 AsstO: A Recommender System for Power Plants AsstO: A Qualtatve MDP-based Recommender System for Power Plant Operaton 15 AsstO s an ntellgent assstant that provdes useful recommendatons for tranng and on-lne assstance n the power plant doman. AsstO was bult specally to demonstrate the potental of the qualtatve MDP approach to solve plannng problems n complex domans. The recommender system s coupled to a power plant smulator capable to partally reproduce the operaton of a combned cycle power plant (CCPP), n partcular, the steam generaton process (HRSG), descrbed n secton 2. The smulator (fgure 7) s provded wth controls for settng up the power condtons n the gas and steam turbnes (nomnal load, medum load, mnmum load, hot standby condton, low speed, and start-up). It ncludes an operaton panel to confgure load demands, unt trps, shutdowns, and other hgh level operatons n dfferent plant subsystems. It also ncludes a vsualzaton tool for trackng the behavor n tme of a set of varables selected by the user, and a functon for recordng hstorcal data. Fg. 7. A screen shot of human computer nterface of the steam generaton smulator. The smulator provdes controls, an operaton panel, and data vsualzaton tools 6.1. General Archtecture The AsstO recommender system s composed by a decson model base, a smulaton data base, and the followng subsystems: ) data management, ) model management, ) plannng subsystem, and v) user nterface. Fgure 8 shows AsstO s general archtecture. The smulaton data base allocates the process sgnals generated by the smulator (outputs), and the control sgnals (nputs) sent by an nstructor to set up a specfc electrc load or falure condton n the process. On the other hand, the decson model base stores the qualtatve MDP model of the process and ts soluton n form of a polcy. That s, t has the optmal acton that wll be recommended to the operator for every state of the plant subprocess consdered. The polcy s based on a factored representaton of the plant q-states (see secton 4.2), and represented n the form of algebrac decson dagrams (ADDs) [14].
16 Alberto Reyes, L. Enrque Sucar and Eduardo F. Morales Fg. 8. AsstO s general archtecture. Gven a state of the plant obtaned from the smulaton data base, the plannng subsystem queres a recommendaton to the decson model base. Ths recommendaton s presented to the operator va the user nterface The data management subsystem s composed by a set of tools for data admnstraton and analyss. The model management subsystem manpulates the transton and reward models, and the utlty and polcy functons stored n the decson model base. The transton model management system was mplemented n Elvra [8] (whch also was adapted to compute Dynamc Bayesan Networks), and the reward model management system usng Weka [22]. The management of the polcy and utlty models s carred out usng SPUDD [14], whch ncludes model query and prntng capabltes. The plannng subsystem n AsstO s also based on SPUDD [14], whch mplements a very effcent verson of the value teraton algorthm for MDPs as nference method. The plannng subsystem frst approxmates the decson models usng the data allocated n the smulaton data base. Transton and reward models are respectvely learned usng the K2 [9] algorthm avalable n Elvra, and the C4.5 algorthm avalable n Weka (J4.8) [20]. Then t uses these models and ts nference algorthms to obtan an optmal polcy, from whch the recommendatons that wll be gven to the operator are obtaned. The resultng transton and reward functons, and polcy and utlty functons are then stored n the decson model base. The plannng subsystem transforms the contnuous plant state nto the qualtatve representaton descrbed n sectons 4 and 5 for problem specfcaton and polcy query purposes. The user nterface provdes the communcaton wth the envronment. In ths case, the power plant smulator s the envronment, and the operator s the actor that executes the recommendatons that modfy the envronment. The user nterface provdes controls for command executon, load selecton, falure smulaton, and recommendaton dsplay. Ths module, whch can also be used as a supervson console, ncludes the controls for random exploraton and system samplng for the learnng purposes descrbed n secton 4.3. It also provdes a graphcal nterface to observe how fast the correct executon of recommendatons mpact on the plant operaton. The man screen of the user nterface s shown n Fgure 9. Currently AsstO s used for operator tranng. In a tranng sesson, the plannng subsystem obtans the plant q- state from the smulaton data base. Then t queres the polcy functon for the current q-state n the model base to obtan a recommendaton. Both, current q-state and recommendaton are shown graphcally to the operator through the user nterface, who fnally decdes whether or not to execute the recommended command. The sequental executon of these recommendatons wll help the operator to get the plant to an optmal operatng condton.
AsstO: A Qualtatve MDP-based Recommender System for Power Plant Operaton 17 Fg. 9. User Interface. It s the graphcal lnk between the recommender system and the operator. It ncludes supervson features, problem specfcaton utltes, dsplay console, and manual control capabltes 6.2. Expermental Results We used AsstO to run two sets of experments wth dfferent complextes. In the frst set of experments, we specfed a 5-acton hybrd problem wth 5 varables ( Fms, Ffw, Pd, g, d ). We also defned a smple bnary reward functon based on the safety parameters of the drum ( and ). The relatonshp between ther values and the reward receved can be seen n fgure 10 (left). Central black squares denote safe states (desred operaton regons), and whte zones represent non-rewarded zones (ndfferent regons). To learn the model and the ntal abstracton, samples of the system dynamcs were gathered usng smulaton. Black dots n fgure 10 (rght) represent sampled states wth postve reward, red (gray) dots have no reward, and whte zones were smply not explored. Fgure 10 (left) shows the state partton and polcy found (arrows) by the learnng system. For ths smple example, although the resultng polcy s not very detaled ( qstates are qute large), t drects the plant to the optmal operatng condton (black regon n the mddle). When analyzed by an expert operator, ths control strategy s nearoptmal n most of the abstract states. We solved the same problem but addng two extra varables, the poston for valves msv and fwv, and usng 9 actons (all the combnatons of open-close valves msv and fwv ). We also redefned the reward functon to maxmze power generaton, g, under safe condtons n the drum. Although the problem ncreased sgnfcantly n complexty, the polcy obtaned s smoother than the 5-acton smple verson presented above. To gve an dea about the computatonal savng, for a fne dscretzaton (15,200 dscrete states) ths problem was solved n 859.2350 seconds, whle our abstract representaton (40 q-states) took only 14.2970 seconds. In both cases, the solutons were found usng the SPUDD system [14]. In summary, the frst experment shows that the proposed approach obtans approxmately optmal polces; whle the second experment demonstrates a sgnfcant reducton n the soluton tme n comparson to a fne dscretzaton of the state space. Pd Fms
18 Alberto Reyes, L. Enrque Sucar and Eduardo F. Morales Fg. 10. Process control problem. Left: qualtatve state partton n terms of the Steam Flow and Drum Pressure. For each q state t shows the optmal acton (arrows). The black regon represents the desred operatng state (hgh reward). Rght: an mage of the exploraton trace, where black dots represent sampled states wth postve reward, red dots (gray) are sampled states wth no reward, and whte regons are unexplored zones 7 Conclusons and Future Work In ths paper, we presented a novel and practcal model-based learnng approach wth teratve refnement for solvng contnuous and hybrd Markov decson processes. In the frst phase we use an exploraton strategy of the envronment and a machne learnng approach to nduce an ntal state abstracton. We then follow a refnement process to mprove the ntal abstracton by performng local tests on the varance of utlty values. Our approach creates sgnfcant reductons n space and tme allowng to solve effcently contnuous and hybrd problems. We tested our method n a power plant doman usng AsstO, showng that ths approach can be appled to complex domans where a smple dcretzaton approach s not feasble or computatonally too expensve. Snce AsstO s amed ether for operaton assstance and operator tranng, we are currently developng an extra module that explans the recommended commands generated by the plannng subsystem and, provdes, after a bad decson, the reason why a recommendaton should have been followed. We plan to extend the plannng subsystem to support partally observable MDPs, and use the AsstO archtecture n other power plant applcatons. As future research work we wll lke to mprove our refnement strategy to select a better segmentaton of the abstract states and consder alternatve search strateges. We also plan to test our approach n other domans. Acknowledgments Ths work was supported jontly by the Insttuto de Investgacones Eléctrcas, Mexco and CONACYT Project No. 47968. References 1. J. Baum and A. E. Ncholson. Dynamc non-unform abstractons for approxmate plannng n large structured stochastc domans. In PRICAI 98 Proceedngs of the 5th Pacfc Rm Internatonal Conference on Artfcal Intellgence, pages 587 598, Sngapore, 1998. 2. R.E. Bellman. Dynamc Programmng. Prnceton U. Press, Prnceton, N.J., 1957. 3. D. P. Bertsekas. A counter-example to temporal dfference learnng. Neural Computaton, 1994.
AsstO: A Qualtatve MDP-based Recommender System for Power Plant Operaton 19 4. D. P. Bertsekas and J.N. Tstskls. Neuro-dynamc programmng. Athena Scences, 1996. 5. B. Bonet and J. Pearl. Qualtatve MDPs and POMDPs: An order-of-magntude approach. In Proceedngs of the 18th Conf. on Uncertanty n AI, UAI-02, pages 61 68, Edmonton, Canada, 2002. 6. C. Boutler, T. Dean, and S. Hanks. Decson-theoretc plannng: structural assumptons and computatonal leverage. Journal of AI Research, 11:1 94, 1999. 7. C. Boutler, M. Goldszmdt, and B. Sabata. Contnuous value functon approxmaton for sequental bddng polces. In Kathryn Laskey and Henr Prade, edtors, Proceedngs of the 15th Conference on Uncertanty n Artfcal Intellgence (UAI-99), pages 81 90. Morgan Kaufmann Publshers, San Francsco, Calforna, USA, 1999. 8. Elvra Consortum. Elvra: an envronment for creatng and usng probablstc graphcal models. Techncal report, U. de Granada, Span, 2002. 9. G. F. Cooper and E. Herskovts. A bayesan method for the nducton of probablstc networks from data. Machne Learnng, 1992. 10. T. Dean and K. Kanazawa. A model for reasonng about persstence and causaton. Computatonal Intellgence, 5:142 150, 1989. 11. Z. Feng, R. Dearden, N. Meuleau, and R. Washngton. Dynamc programmng for structured contnuous Markov decson problems. In Proc. of the 20th Conf. on Uncertanty n AI (UAI-2004). Banff, Canada, 2004. 12. C. Guestrn, M. Hauskrecht, and B. Kveton. Solvng factored MDPs wth contnuous and dscrete varables. In Twenteth Conference on Uncertanty n Artfcal Intellgence (UAI 2004), Banff, Canada, 2004. 13. M. Hauskrecht and B. Kveton. Lnear program approxmaton for factored contnuous-state Markov decson processes. In In Advances n Neural Informaton Processng Systems NIPS(03), pages 895 902, 2003. 14. J. Hoey, R. St-Aubn, A. Hu, and C. Boutler. SPUDD: Stochastc plannng usng decson dagrams. In Proc. of the 15th Conf. on Uncertanty n AI, UAI-99, pages 279 288, 1999. 15. L. L and M. L. Lttman. Lazy approxmaton for solvng contnuous fnte-horzon MDPs. In AAAI-05, pages 1175 1180, Pttsburgh, PA, 2005. 16. R. Munos and A. Moore. Varable resoluton dscretzaton for hgh-accuracy solutons of optmal control problems. In Thomas Dean, edtor, Proceedngs of the 16th Internatonal Jont Conference on Artfcal Intellgence (IJCAI-99), pages 1348 1355. Morgan Kaufmann Publshers, San Francsco, Calforna, USA, August 1999. 17. J. Pneau, G. Gordon, and S. Thrun. Polcy-contngent abstracton for robust control. In Proc. of the 19th Conf. on Uncertanty n AI, UAI-03, pages 477 484, 2003. 18. M. L. Puterman. Markov Decson Processes. Wley, New York, 1994. 19. J.R. Qunlan. Inducton of decson trees. Machne Learnng, 1(1):81 106, 1986. 20. J.R. Qunlan. C4.5: Programs for machne learnng. Morgan Kaufmann, San Francsco, Calf., USA., 1993. 21. R. S. Sutton and A.G. Barto. Renforcement Learnng: An Introducton. MIT Press, 1998. 22. I.H. Wtten. Data Mnng: Practcal Machne Learnng Tools and Technques wth Java Implementatons, 2nd Ed. Morgan Kaufmann, USA, 2005. Alberto Reyes s a researcher at the Electrcal Research Insttute n Méxco (IIE) and part-tme professor at Insttuto Tecnológco y de Estudos Superores de Monterrey (ITESM) campus Méxco Cty. Hs research nterests nclude decson-theoretc plannng, machne learnng, and ther applcatons n robotcs and ntellgent assstants for ndustry. He receved a PhD n Computer Scence from ITESM campus Cuernavaca.
20 Alberto Reyes, L. Enrque Sucar and Eduardo F. Morales L. Enrque Sucar s a Senor Researcher at the Natonal Insttute for Astrophyscs, Optcs and Electroncs (INAOE) n Puebla, Mexco. Hs research nterests nclude reasonng under uncertanty n artfcal ntellgence, moble robotcs and computer vson. He receved a B.S. n Electroncs and Communcatons Engneerng from the Monterrey Insttute of Technology, n Monterrey, Mexco, a M.Sc. n Electrcal Engneerng from Stanford Unversty, and a Ph.D. degree n Computer Scence from Imperal College, London. He has been presdent of the Mexcan AI Socety, member of the Advsory Commttee for IJCAI, and member of the Natonal Research System and the Mexcan Academy of Scence. Eduardo F. Morales s a Senor Researcher at the Natonal Insttute for Astrophyscs, Optcs and Electroncs (INAOE) n Puebla, Mexco. Hs research nterests nclude machne learnng and moble robotcs. He receved a B.Sc. degree n Physcs Engneerng from Unverdad Autonoma Metropoltana, n Mexco Cty, an M.Sc. n Artfcal Intellgence from Ednburgh Unversty, Scotland, and a Ph.D. degree n Computer Scence from The Turng Insttute- Strathclyde Unversty, n Glasgow. He s a member of the Natonal Research System n Mexco.