Dynamic Pricing for Smart Grid with Reinforcement Learning

Dynamc Prcng for Smart Grd wth Renforcement Learnng Byung-Gook Km, Yu Zhang, Mhaela van der Schaar, and Jang-Won Lee Samsung Electroncs, Suwon, Korea Department of Electrcal Engneerng, UCLA, Los Angeles, USA Department of Electrcal and Electronc Engneerng, Yonse Unversty, Seoul, Korea Abstract In the smart grd system, dynamc prcng can be an effcent tool for the servce provder whch enables effcent and automated management of the grd. However, n practce, the lack of nformaton about the customers tme-varyng load demand and energy consumpton patterns and the volatlty of electrcty prce n the wholesale market make the mplementaton of dynamc prcng hghly challengng. In ths paper, we study a dynamc prcng problem n the smart grd system where the servce provder decdes the electrcty prce n the retal market. In order to overcome the challenges n mplementng dynamc prcng, we develop a renforcement learnng algorthm. To resolve the drawbacks of the conventonal renforcement learnng algorthm such as hgh computatonal complexty and low convergence speed, we propose an approxmate state defnton and adopt vrtual experence. Numercal results show that the proposed renforcement learnng algorthm can effectvely work wthout a pror nformaton of the system dynamcs. I. INTRODUCTION In the smart grd system, thanks to the real-tme nformaton exchange through communcaton networks, customers can schedule the operaton of ther applances accordng to the change of electrcty prce va the automated energy management system equpped n households, whch we refer to as demand response [1]. From the customers perspectve, prevous works on the load schedulng focus on drectly controllng the energy consumpton of the resdental applances. For example, n our prevous work [2], we proposed two dfferent load schedulng algorthms for a collaboratve and a non-collaboratve smart grd system by takng nto account the customers bdrectonal energy tradng capablty va electrc vehcles. Most of these works am at maxmzng the socal welfare of the smart grd system assumng that the prcng polces are predetermned n the electrcty market by the servce provder. Consequently, the servce provder s regarded as a passve and ndfferent entty whose role n the smart grd system s sgnfcantly lmted. On the contrary, from the servce provder s perspectve, dynamc prcng s an attractve tool that enables effcent grd operaton n terms of both effcent energy consumpton and automated management. Our paper s related to ths second strand of lterature. Specfcally, we focus on a scenaro where the servce provder can adaptvely decde the retal electrcty prce based on the customers load demand level and the wholesale prce such that t mnmzes ether the customers dsutlty (n the case of a benevolent servce provder) or ts own cost (n the case of a proft-makng servce provder). Although dynamc prcng does not drectly control each Ths work was supported n part by Md-career Researcher Program through NRF grant funded by the MSIP, Korea (213R1A2A2A6953), and by USA Natonal Scence Foundaton (NSF) grant CCF-1218136. customer s load schedulng, the approprate prcng can gve consderable benefts to the smart grd system by encouragng the customers to consume energy n a more effcent way. Recently, there have been several works on dynamc prcng for smart grd [3][4][5][6][7][8]. In [3] and [4], dynamc prcng problems were studed amng at maxmzng the socal welfare. Consderng a smart grd system wth multple resdences and a sngle servce provder, optmal dynamc prcng schemes were proposed based on the dual decomposton approaches. The authors n [5] focused on the smart grd system wth non-cooperatve customers where the conventonal optmzaton approach cannot be appled to as n [3]. To overcome the lack of cooperaton of customers, a smulated annealng-based dynamc prcng algorthm was developed. In a smlar context, the authors n [6] modeled a dynamc prcng problem as a Stackelberg game where the servce provder decdes the retal prce and each selfsh customer decdes the schedule for ts applances accordng to the prce. In [7], the authors developed an ncentve-based dynamc prcng scheme whch allows the servce provder to decde the ncentve for the customers who shft ther applances usage from peak hours to off-peak hours. In [8], the authors ntroduced a twotmescale dynamc prcng scheme to ncorporate both the customers wth day-ahead schedulng and the customers wth real-tme schedulng. To take nto account the uncertantes of energy supply and demand, the authors formulated a Markov decson process (MDP) problem and developed an onlne algorthm. Despte those prevous efforts, there stll exst several crtcal challenges n mplementng dynamc prcng for demand response. Frst, n the practcal smart grd system, t s not easy for the servce provder to obtan the customer-sde nformaton such as ther current load demand levels and the transton probablty of the demand levels, and the customer-specfc utlty models ncludng the wllngness to purchase electrc energy gven ther load demand level and retal prce. Second, even f the servce provder can obtan those nformaton, t wll surfer from varous system dynamcs and uncertantes. The servce provder whch les between the utlty company and the customers may not obtan the perfect nformaton of those system dynamcs a pror. Fnally, the servce provder s requred to have the ablty to estmate the mpact of ts current prcng decson on the customers future behavor. In fact, the current prce nfluences not only the customers current energy consumpton but also ther energy consumpton for the next several hours or the next day. However, t s not easy for the servce provder to calculate the optmal prce consderng the future nfluence of the current prce wthout the detal customer-sde nformaton. Thus, most of exstng

2 works on dynamc prcng for smart grd have been studed n myopc approaches where the algorthms for dynamc prcng and demand sde load schedulng are conducted wthn a gven tme perod wthout consderng the long-term performance of the smart grd system. In order to overcome the aforementoned challenges of dynamc prcng, n ths paper, we use renforcement learnng to allow the servce provder to learn the behavors of customers and the change of wholesale prce to make an optmal prcng decson. We consder varous stochastc dynamcs of the smart grd system ncludng the customers dynamc demand generaton and energy consumptons, and wholesale prce changes. Based on the consdered system model, we formulate an MDP problem where the servce provder decdes the retal electrcty prce based on the observed system state transton to mnmze ts expected total cost or the customers dsutlty. Contrary to the prevous works [3]-[8] wth smplfed models, n ths paper, we consder a more realstc system model to ncorporate the customers demand generaton and load demand change as well as the wholesale market dynamc where the wholesale electrcty prce can be changed by the utlty company at each tme-slot. To solve the MDP problem wthout a pror nformaton about the change of the customers load demand level, we adopt the Q-learnng algorthm, and to resolve the exstng drawbacks of the conventonal Q-learnng algorthm, we propose the followng two mprovements. Frst, to reduce the complexty of the Q-learnng algorthm whch manly comes from the large number of customers, we propose an alternatve state defnton based on the observed total energy consumpton. Second, to mprove the learnng speed, we adopt vrtual experence n Q-learnng updates. The rest of ths paper s organzed as follows. In Secton II, the system model s presented. In Secton III, we defne the dynamc prcng problem and develop the renforcement learnng-based dynamc prcng algorthm. We provde numercal results n Secton IV and fnally conclude n Secton V. II. SYSTEM MODEL We consder a smart grd system whch conssts of one servce provder and a set of customers I as n Fg. 1. The smart grd system operates n a tme-slotted fashon, where each tme-slot has an equal duraton. At each tme-slot t, the servce provder buys electrc energy from the utlty company through a wholesale electrcty market and provdes t to the customers through a retal electrcty market. In the retal electrcty market, at each tme-slot t, the servce provder determnes the retal prcng functon a t : R + R + and charges each customer an electrcty bll a t (e t ), where et denotes customer s energy consumpton at tme-slot t. We defne the set of retal prcng functons as A and assume that the number of retal prcng functons, A, s fnte. At each tme-slot, each customer generates ts electrcty load demand and decdes the amount of energy consumpton based on ts current load demand level and the retal prcng functon. We assume that the customers average demand generaton rate and the wholesale prcng functon at a tmeslot can vary dependng on ts actual tme n a day. Moreover, ths tme-dependency of customers energy consumpton may nfluence the utlty company s decson on the wholesale prcng functon. To model ths tme-dependency of the demand generaton rate and the wholesale prcng functon, we ntroduce a set of perods H = {,1,,H 1} each of t c ( ) Utlty Company Wholesale prcng functon t t!eˆ ( d ) Servce Provder Flow of nformaton Retal prcng functon t a ( ) Flow of electrc energy Fg. 1. Smart grd system. eˆ ( d ) t t 1 1 1 2... t t eˆ! ( d ) Customers whch represents an actual tme n a day. We map each tmeslot t to one perod h H denotng the perod at tme-slot t by h t. We assume that the sequence of perods h t,t =,1,2, s predetermned and repeated every day. For example, f one day conssts of H = 24 perods (.e., 24 hours), each tmeslot t s mapped to one perod n H = {,1,,23} and the mappng between tme-slots and perods can be represented as h t = mod (t,h), t. (1) A. Model of Customer s Response In each tme-slot, each customer has an accumulated load demand 1, whch s defned as the total amount of energy that t wants to consume for ts applances n that tme-slot. We denote the amount of the accumulated load demand of customer at tme-slot t by d t D, where D s the set of customer accumulated load demand levels. Once customer consumes energy e t at tme-slot t, t mples that the correspondng amount of customer s load demand s satsfed and the rest of the accumulated load demand d t et s not satsfed and we call t the remanng load demand. When the remanng load demand of a customer s greater than, t causes some degree of dssatsfacton to the customer at that tme-slot. To capture ths dssatsfacton from the remanng load demand, we ntroduce a dsutlty functon for u : R + R + for each customer. We assume that u ( ) s an ncreasng functon of the remanng load demandd t et. Based on the dsutlty u (d t et ) and the electrcty bll a t (e t ) that customer has to pay to the servce provder, we defne customer s cost at each tme-slot t as φ t (dt,et ) = u (d t et )+at (e t ). (2) We assume that each customer tres to mnmze ts cost at each tme-slot by decdng the amount of ts energy consumpton and let ê t (dt ) denote customer s energy consumpton decson that mnmzes ts cost 2,.e., ê t (dt ) = argmn φ t e t mn(emax,d t ) (dt,et ), (3) 1 For the convenence, accumulated load demand and load demand are used nterchangeably n the rest of ths paper. 2 Customer s energy consumpton decson s also a functon of the retal prcng functon a t. However, for the smple expresson, we represent t as ê t (dt ).

3 where e max s the maxmum amount of energy that customer can consume at each tme-slot due to physcal lmtatons of the grd. We assume that a porton of each customer s remanng load demand at a tme-slot s carred forward to the next tme-slot. We call the correspondng load demand the demand backlog and represent t as λ (d t êt (dt )), where λ 1 s the backlog rate of load demand that determnes the amount of demand backlog of customer. At each tme-slot t, each customer randomly generates ts new load demand, D t(ht ), and ts dstrbuton s assumed to be dependent on the current perodh t. At the begnnng of each tme-slott+1, accordng to the demand backlog at the prevous tme-slott,λ (d t êt (dt )), and the newly generated load demand, D t+1 (h t+1 ), customer s accumulated load demand d t+1 s updated as d t+1 = λ (d t êt (dt ))+Dt+1 (h t+1 ). (4) Here, f λ =, no remanng demand at a tme-slot s carred forward generatng no demand backlog, whereas f λ = 1, all the remanng demand at a tme-slot s carred forward to the next tme-slot. It s worth notng that the transton probablty of the load demand from d t to dt+1 depends only on the accumulated load demand d t, the perod ht, and the retal prcng functon a t at tme-slot t, and we represent t as p d (d t+1 d t,ht,a t ). B. Wholesale Electrcty Market At each tme-slot t, the servce provder buys electrc energy, whch corresponds to the total amount of energy consumpton of customers, I êt (dt ), from the utlty company n the wholesale electrcty market as llustrated n Fg. 1. The utlty company charges the servce provder an wholesale electrcty cost based on a wholesale prcng functon c t : R + R +, where c t s a functon of the total amount of energy consumpton I êt (dt ). We assume that ct s selected among a fnte number of wholesale prcng functons n set C and ts transton probablty from c t to c t+1 depends on the current wholesale prcng functon, c t, and the current perod, h t, and thus t can be represented as p c (c t+1 c t,h t ). We defne the servce provder s cost functon at each tmeslot t as a functon of the customers load demand vector d t = [d t ] I, the wholesale electrcty prcng functon c t, and the retal prcng functon a t,.e., ψ t ( d t,c t,a t ) = c t( ) ê t (d t ) a t (ê t (d t )), (5) I I where the frst term denotes the total wholesale electrcty cost that the servce provder has to pay to the utlty company and the second term denotes the servce provder s revenue from sellng energy to the customers. III. REINFORCEMENT LEARNING ALGORITHM In ths secton, based on the smart grd system ntroduced n the prevous secton, we frst formulate a dynamc prcng problem n the framework of MDP. Then, by usng renforcement learnng, we develop an effcent and fast dynamc prcng algorthm whch does not requre the nformaton about the system dynamcs and uncertantes. A. Problem Formulaton We formulate the dynamc prcng problem n the smart grd system as an MDP problem, whch s defned by a set of decson maker s actons, a set of system states and ther transton probabltes, and a system cost functon for the decson maker. In our MDP problem, the decson maker s the servce provder whose acton s choosng a retal prcng functon a t A at each tme-slot t. We defne the state of our smart grd system at tme-slot t as the combnaton of the accumulated load demands vector, dt, the current perod h t, and the wholesale prcng functon, c t,.e., s t = ( d t,h t,c t ) S, (6) where S = I D H C. Snce the transton of each customer s load demand (from d t to dt+1 ), that of the perod (from h t to h t+1 ), and that of the wholesale prcng functon (from c t to c t+1 ) depend only on the state s t and acton a t at tme-slot t, the sequence of states {s t,t =,1,2, } follows a Markov decson process wth acton a t. The transton probablty from state s t = ( d t,h t,c t ) to state s t+1 = ( d t+1,h t+1,c t+1 ) wth gven acton a t can be represented as p s (s t+1 s t,a t ) =p h (h t+1 h t )p c (c t+1 c t,h t ) (7) I p d (d t+1 d t,h t,a t ), where p h (h t+1 h t ) denotes the transton probablty of the perod from h t to h t+1. We defne the system cost for the servce provder at each tme-slot t as the weghted sum of the servce provder s cost and the customers cost at the tme-slot: r t (s t,a t ) = (1 ρ)ψ t ( d t,c t,a t )+ρ I φ t (d t,ê t ), (8) where ρ denotes the weghtng factor that determnes the relatve mportance between the servce provder s cost and the customers cost. We denote the statonary polcy that maps states to actons (retal prcng functons) by π : S A,.e, a t = π(s t ). The objectve of our dynamc prcng problem s to fnd an optmal polcy π for each state s S that mnmzes the expected dscounted system cost of the servce provder as n the followng MDP problem (P): [ ] (P) : mn E (γ) t r t (s t,π(s t )), (9) π:s A t= where γ < 1 s the dscount factor whch represents the relatve mportance of the future system cost compared wth the present system cost. The optmal statonary polcy π can be well defned by usng the optmal acton-value functon Q : S A R whch satsfes the followng Bellman optmalty equaton: Q (s,a) = r(s,a)+γ s Sp(s s,a)v (s ), () where V (s ) s the optmal state-value functon [9], whch s defned as V (s ) = mn a A Q (s,a), s S. (11)

4 Snce Q (s,a) s the expected dscounted system cost wth acton a n state s, we can obtan the optmal statonary polcy as π (s) = argmnq (s,a). (12) a A In ths paper, we use the well-known Q-learnng algorthm to solve our MDP problem (P) wthout nformaton of state transton probabltes. We refer the readers to [] for more detal on the Q-learnng algorthm. In the followng subsectons, n order to resolve the crtcal ssues whch make t dffcult to apply the conventonal Q-learnng algorthm to our smart grd system, we propose an alternatve state defnton based on the observed total energy consumpton as well as adopt vrtual experence n Q-learnng updates. B. Energy Consumpton-Based Approxmate State (EAS) When the sze of the state space s large, the Q-learnng algorthm requres not only a large memory space to store the state-acton functon Q(s, a), but also a long tme to converge. Moreover, n the practcal smart grd system, t s dffcult for the servce provder to acqure or use the nformaton about the customers current load demands due to prvacy. In order to resolve these dffcultes, n ths secton, we propose an alternatve defnton of the system state, whch s based on the observed total energy consumpton, I êt 1, and the prevously chosen acton a t 1. For notatonal convenence, we wll omt d t n êt (dt ) n the rest of ths secton. The man dea of ths alternatve state defnton comes from the fact that, snce each customer s dsutlty functon u s a decreasng functon of ê t, from the load demand update process n (4), the retal prcng functon and each customer s energy consumpton at tme-slot t 1 n a tuple (ê t 1,a t 1 ) characterzes the past accumulated load demand d t 1. Hence, f the servce provder knows new load demand D t(ht ) at tme-slot t, t can regard a dfferent tuple (ê t 1,a t 1,D t(ht )) as a dfferent actual accumulated load demand of customer at tme-slot t. Smlarly, once a tuple ( I êt 1,a t 1, I Dt (ht )) s observed by the servce provder, t approxmately reflects the customers overall load demands at tme-slot t. It s worth notng that snce D t(ht ) s ndependent random varable for each customer, by the law of the large number, the average of the sum of new load demands, I Dt (ht )/ I, goes to ts expected value as the number of customers gets larger. Ths mples that n the practcal smart grd system wth a large number of customers, a tuple( I êt 1,a t 1 ) provdes enough nformaton for the servce provder to nfer the customers overall load demand level at tme-slot t. Hence, nstead of the orgnal state s t n (6), we can use a new state defnton based on the observed energy consumpton by whch the servce provder does not need to know ether each customer s load demand or ts dsutlty functon. To reduce the number of system states, we dscretze the observed energy consumpton I êt 1 nto a fnte number of energy levels ne 3 The set of observed energy consumpton level by usng a quantzaton operaton q E ( ). Then, we refer 3 In our system model, the method of dscretzaton of the observed energy consumpton s not lmted to a specfc method. A smple example s to use an equally dvded levels between the maxmum and mnmum amounts of the energy consumpton. Algorthm 1 Q-Learnng Algorthm wth Vrtual Experence 1: Intalze Q arbtrarly, t = 2: for each tme-slot t 3: Choose a t accordng to polcy π(x t ) 4: Take acton a t, observe system cost r(x t,a t ) and next state x t+1 5: Obtan experence tuple σ t+1 = (x t,a t,r t,x t+1 ) 6: Generate set of vrtual experence tuples θ(σ t+1 ) 7: for each vrtual experence tuple σ t+1 θ(σ t+1 ) 8: v = r(x t,a t )+γmax a AQ(x t+1,a ) Q(x t,a t ) 9: Q(x t,a t ) Q(x t,a t )+α t v : end 11: end to tuple (q E ( I êt 1 ),a t 1 ) as the approxmate demand at tme-slot t and represent t as ( ) d t app = (q E ê t 1,a t 1). (13) I Based on the approxmate demand, we now defne the energy consumpton-based approxmate state (EAS) of the smart grd system as x t = (d t app,h t,c t ) X, (14) where X = E A H C denotes the set of the EASs. Note that EAS extremely reduces the number of states from S = I D H C to X = E A H C, whle allowng the servce provder to easly nfer the customers current load demands level wthout usng drect sgnalng from the customers. Now, we can smply substtute the orgnal state defnton s t by EAS x t n the Q-learnng algorthm. C. Accelerated Learnng usng Vrtual Experence Although the EAS x t sgnfcantly reduces the state space, the learnng speed of the Q-learnng algorthm mght be serously lmted by ts nherent structure n whch only one state-acton par s updated at each tme-slot. In ths subsecton, n order to mprove the speed of the Q-learnng algorthm, we adopt vrtual experence whch was ntroduced n [11]. The Q-learnng algorthm wth vrtual experence enables the servce provder to update multple state-acton pars at each tme-slot by explotng a pror known partal nformaton of the state transton probablty p x (x t+1 x t,a t ). In ths subsecton, we consder the case where the servce provder knows the transton probablty of the wholesale prcng functon p c (c t+1 c t,h t ) a pror. Note that ths s not a too restrctve assumpton because the servce provder can gather suffcent data for the transton probablty of the wholesale prcng functon whle partcpatng n the wholesale electrcty market. We frst defne the experence tuple (ET) observed by the servce provder at tme-slot t+1 as σ t+1 = (x t,a t,r t,x t+1 ), where r t s the observed system cost. Then, gven the actual ET σ t+1, we defne a set of vrtual experence tuples (vrtual ETs) θ(σ t+1 ), whch are statstcally equvalent 4 to the actual ET σ t+1,.e., θ(σ t+1 ) = σt+1 d t app = dt app, h t = h t, ã t = a t, r t = r( c t ), p c ( c t+1 c t, h t ) = p c (c t+1 c t,h t ), (15) where r(c) represents the system cost that s vrtually calculated by usng an arbtrary wholesale prcng functonc. In our 4 An ET σ t+1 = ( x t,ã t, r t, x t+1 ) s sad to be statstcally equvalent to ET σ t+1 = (x t,a t,r t,x t+1 ) f p x( x t+1 x t,ã t ) = p x(x t+1 x t,a t ) and the system cost r t can be calculated by usng σ t+1.

5 TABLE I COMPLEXITY COMPARISON OF THREE DIFFERENT ALGORITHMS. Q-learnng wth orgnal state Q-learnng wth EAS Q-learnng wth EAS and vrtual experence 3 Learnng update complexty per teraton O( A ) Memory complexty O( I D H C A ) O( A ) O( E H C A 2 ) O( θ(ˆσ) A ) O( E H C A 2 ) Average system cost 35 3 25 2 15 5 Renforcement learnng Myopc optmzaton kwh 2.5 2 1.5 1.5 5 15 2 Tme slot Fg. 2. Load demand profle. smart grd system, f d t app and ht are fxed, the system cost r t can be easly calculated for an arbtrary wholesale prcng functons c t C by applyng the same energy consumpton I êt to (8). Moreover, snce the transton probablty of the wholesale prcng functon p c (c t+1 c t,h t ) s ndependent of approxmate demand d t app and retal electrcty prcng functon a t, we can easly generate the set of vrtual ETs θ(σ t+1 ) n (15) from the observed actual ET σ t+1. Whle the observed ET σ t+1 s used to update only one state-acton functon Q(x t,a t ) n the conventonal Q-learnng, by usng the vrtual ETs, the Q-learnng algorthm can update multple state-acton pars at each tme-slot. The Q-learnng algorthm wth vrtual experence s outlned n Algorthm 1. Lnes 3-5 descrbe the operaton of the conventonal Q-learnng where the servce provder obtans the experence tuple σ t+1. Then, n lne 6, based on σ t+1, a set of ts vrtual experences s generated. In lnes 7-, the acton-value functon Q(x t,a t ) s updated for all vrtual experences n σ t+1. The complexty of the proposed renforcement learnng algorthms are summarzed n Table I. Although the Q-learnng algorthm wth vrtual experence has a hgher update complexty than the Q-learnng algorthm wthout vrtual experence, as we wll show n Secton IV, t sgnfcantly reduces the number of tme-slots needed to converge, whch s regarded as a more mportant aspect than the computatonal complexty at each tme-slot n renforcement learnng algorthms. IV. NUMERICAL RESULTS In ths secton, we provde numercal results to evaluate the performance of our dynamc prcng algorthm. One day conssts of 24 tme-slots each of whch lasts for one hour. Hence, the set of perods s gven as H = {,1,,23} and the mappng between tme-slots and perods s gven as h t = mod (t,24)..2.4.6.8 1 λ Fg. 3. Performance comparson of our renforcement learnng algorthm and the myopc optmzaton algorthm varyng λ. We consder a smart grd system wth 2 customers. The newly generated load demand of customer, D t(ht ), follows a Posson dstrbuton wth expected value ω,h t, whch s proportonal to the hourly average load shapes of resdental electrcty servces n Calforna [12] as shown n Fg. 2. All customers have the same backlog rate,.e., λ = λ, I. Each customer s dsutlty functon u (d t et ) s gven as u (d t e t ) = κ (d t e t ) 2, (16) where κ s a constant that represents customer s dsutlty senstvty to ts remanng demand. Here, we let κ = κ =.1, I. We model the wholesale prcng functon c t as a quadratc functon of the total energy consumpton I êt as n [3],[4], and [6] : c t( ) ê t = µ t ( 2. ê t +νh t ê) t (17) t I I I We set µ t =.2, t andν t h t to be a random varable whose expected value, v h t, changes accordng to the correspondng perod h t based on the hourly average load shape n Fg. 2. Wth a gven perodh t,ν t h t s unform randomly chosen among values n {.25v h t,.5v h t,,1.75v h t}. The dscount factor γ n problem (P) s fxed to.95. The retal prcng functon a t s a lnear functon of the energy consumpton ê t,.e., a t (ê t ) = χt ê t, (18) where the coeffcent χ t can be chosen among set {.2,.4,,1.} each element of whch s drectly mapped to one retal prcng functon n A. We frst evaluate the performance of the dynamc prcng algorthm by comparng t wth that of the myopc optmzaton algorthm. In the myopc optmzaton algorthm, the servce provder chooses an acton wth the lowest expected nstantaneous system cost, by updatng the state-acton functon Q(s, a) smlarly to the Q-learnng update wth a dscount factor γ =. Ths mples that the myopc optmzaton algorthm focuses only on the mmedate system cost wthout consderng the mpact of the current acton on the future system cost. In Fg. 3, we show the average system costs of those two dynamc prcng algorthms by changng the backlog rate, λ, from to 1. We set ρ =.5 whch corresponds to the case where the servce provder ams at mnmzng the sum of the total dsutlty and the wholesale cost. We can observe that the average system costs ncrease as λ ncreases

6 Cost of customers $/kwh 25 2 15 Average cost of customers Average cost of servce provder 5.2.4.6.8 1 5 ρ 1.5 Average retal prce.2.4.6.8 1 ρ Fg. 4. Impact of the weghtng factor ρ on the performances of customers and servce provder. Average system cost 11.5 9.5 9 8.5 8 7.5 7.5 1 1.5 2 Tme slot x 4 15 5 Cost of servce provder wth vrtual experence wthout vrtual experence Fg. 5. Impact of vrtual experence on the learnng speed of the Q-learnng algorthm. n both dynamc prcng algorthms because wth a hgher backlog rate, the accumulated load demand causes a hgher dsutlty. We also observe that the performance gap between two algorthms ncreases as λ ncreases. Wth a low backlog rate, the remanng backlog s not carred forward to the next tme-slot. In ths case, the soluton of our renforcement learnng algorthm s the same as that of the myopc optmzaton algorthm. On the contrary, n the case wth a hgher backlog rate, the remanng backlog s carred forward to the next tmeslot. Hence, the servce provder s prcng decson at a tmeslot nfluences the accumulated load demand n the future, and thus ts future system cost. Due to ths dfference, especally when λ s large, our algorthm acheves better performance than the myopc optmzaton problem whch consders only the current system cost. To study the mpact of the weghtng factor ρ, n Fg. 4, we show the cost of customers, that of the servce provder, and the average retal prce wth varyng ρ from to 1. We set λ =.5. We can observe that as ρ ncreases, the servce provder reduces the average retal prce, the cost of customers decreases, and the cost of the servce provder ncreases. For example, n the case wth ρ =, the servce provder ams at mnmzng ts own cost. Hence, the servce provder does not consder the customers dsutlty and chooses relatvely hgh prces to reduce the wholesale cost whch contrbutes to most of ts own cost. In the case wth ρ = 1, the servce provder ams at mnmzng the customers cost. Hence, the servce provder chooses relatvely low prces to provde electrc energy to the customers at a low retal prce as possble. In Fg. 5, we compare the learnng speed of our renforcement learnng algorthm wth vrtual experence to the conventonal Q-learnng algorthm wthout vrtual experence. We set λ =.5 and ρ =.5. We can observe that our algorthm wth vrtual experence acheves the near optmal average system cost after about 3, tme-slots, whle the conventonal Q-learnng algorthm shows a worse learnng speed. Ths means that even f the stochastc characterstc of the system dynamcs vary n tme, the proposed renforcement learnng algorthm can quckly adapt to the tme-varyng envronment by explotng vrtual experence update. V. CONCLUSION In ths paper, we studed a dynamc prcng problem for the smart grd system where the servce provder can adaptvely decde the electrcty prce accordng to the customers load demand levels and the wholesale prce. We developed a renforcement learnng-based dynamc prcng algorthm that enables effcent dynamc prcng wthout requrng the perfect nformaton about the system dynamcs a pror. To resolve the exstng drawbacks of the conventonal renforcement learnng algorthm, we proposed two mprovements: energy consumpton-based approxmate state defnton and the adopton of vrtual experence update n the conventonal Q-learnng algorthm. Numercal results show that the renforcement learnng-based dynamc prcng acheves a hgher long-term performance compared to the myopc optmzaton approach especally n the system where the customers have a hgh demand backlog rate. The results also show that our algorthm results n an mproved learnng speed due to the alternatve state defnton and vrtual experence mplyng that our dynamc prcng algorthm can be appled to the practcal smart grd system. REFERENCES [1] M. H. Albad and E. El-Saadany, Demand response n electrcty markets: An overvew, n IEEE Power Engneerng Socety General Meetng, 27. [2] B.-G. Km, S. Ren, M. van der Schaar, and J.-W. Lee, Bdrectonal energy tradng and resdental load schedulng wth electrc vehcles n the smart grd, IEEE Journal on Selected Areas n Communcatons, vol. 31, no. 7, pp. 1219 1234, 213. [3] P. Samad, A.-H. Mohsenan-Rad, R. Schober, V. Wong, and J. Jatskevch, Optmal real-tme prcng algorthm based on utlty maxmzaton for smart grd, n IEEE SmartGrdComm, 2. [4] P. Tarasak, Optmal real-tme prcng under load uncertanty based on utlty maxmzaton for smart grd, n IEEE SmartGrdComm, 211. [5] L. P. Qan, Y. Zhang, J. Huang, and Y. Wu, Demand response management va real-tme electrcty prce control n smart grds, IEEE Journal on Selected Areas n Communcatons, vol. 31, no. 7, pp. 1268 128, 213. [6] C. Chen, S. Kshore, and L. Snyder, An nnovatve RTP-based resdental power schedulng scheme for smart grds, n IEEE ICASSP, 211. [7] C. Joe-Wong, S. Sen, S. Ha, and M. Chang, Optmzed day-ahead prcng for smart grds wth devce-specfc schedulng flexblty, IEEE Journal on Selected Areas n Communcatons, vol. 3, no. 6, pp. 75 85, 212. [8] M. He, S. Murugesan, and J. Zhang, Multple tmescale dspatch and schedulng for stochastc relablty n smart grds wth wnd generaton ntegraton, n IEEE INFOCOM, 211, pp. 461 465. [9] L. P. Kaelblng, M. L. Lttman, and A. W. Moore, Renforcement learnng: A survey, Journal of Artfcal Intellgence Research, vol. 4, pp. 237 285, May 1996. [] C. Watkns, Learnng from delayed rewards, Ph.D. dssertaton, Cambrdge Unversty, 1989. [11] N. Mastronarde and M. van der Schaar, Jont physcal-layer and systemlevel power management for delay-senstve wreless communcatons, IEEE Transactons on Moble Computng, vol. 12, no. 4, pp. 694 79, 213. [12] Dynamc Load Profles n Calforna. Pacfc Gas & Electrc. [Onlne]. Avalable: http://www.pge.com/tarffs/energy use prces.shtml