Online Multi-Task Learning for Policy Gradient Methods

Transcription

1 Online Multi-ask Learning for Policy Gradient Metods aitam Bou Ammar Eric Eaton University of Pennsylvania, Computer and Information Science Department, Piladelpia, PA USA Paul Ruvolo Olin College of Engineering, Needam, MA USA Mattew E aylor AYLORM@EECSWSUEDU Wasington State University, Scool of Electrical Engineering and Computer Science, Pullman, WA USA Abstract Policy gradient algoritms ave sown considerable recent success in solving ig-dimensional sequential decision making tasks, particularly in robotics owever, tese metods often require extensive experience in a domain to acieve ig performance o make agents more sampleefficient, we developed a multi-task policy gradient metod to learn decision making tasks consecutively, transferring knowledge between tasks to accelerate learning Our approac provides robust teoretical guarantees, and we sow empirically tat it dramatically accelerates learning on a variety of dynamical systems, including an application to quadrotor control 1 Introduction Sequential decision making SDM is an essential component of autonomous systems Altoug significant progress as been made on developing algoritms for learning isolated SDM tasks, tese algoritms often require a large amount of experience before acieving acceptable performance is is particularly true in te case of igdimensional SDM tasks tat arise in robot control problems e cost of tis experience can be proibitively expensive in terms of bot time and fatigue of te robot s components, especially in scenarios were an agent will face multiple tasks and must be able to quickly acquire control policies for eac new task Anoter failure mode of conventional metods is tat wen te production environment differs significantly from te training environment, previously learned policies may no longer be correct Proceedings of te 31 st International Conference on Macine Learning, Beijing, Cina, 2014 JMLR: W&CP volume 32 Copyrigt 2014 by te autors Wen data is in limited supply, learning task models jointly troug multi-task learning ML rater tan independently can significantly improve model performance run & O Sullivan, 1996; Zang et al, 2008; Rai & Daumé, 2010; Kumar & Daumé, 2012 owever, ML s performance gain comes at a ig computational cost wen learning new tasks or wen updating previously learned models Recent work Ruvolo & Eaton, 2013 in te supervised setting as sown tat nearly identical performance to batc ML can be acieved in online learning wit large computational speedups Building upon tis work, we introduce an online ML approac to learn a sequence of SDM tasks wit low computational overead Specifically, we develop an online ML formulation of policy gradient reinforcement learning tat enables an autonomous agent to accumulate knowledge over its lifetime and efficiently sare tis knowledge between SDM tasks to accelerate learning We call tis approac te Policy Gradient Efficient Lifelong Learning Algoritm PG-ELLA te first to our knowledge online ML policy gradient metod Instead of learning a control policy for an SDM task from scratc, as in standard policy gradient metods, our approac rapidly learns a ig-performance control policy based on te agent s previously learned knowledge Knowledge is sared between SDM tasks via a latent basis tat captures reusable components of te learned policies e latent basis is ten updated wit newly acquired knowledge, enabling a accelerated learning of new task models and b improvement in te performance of existing models witout retraining on teir respective tasks e latter capability is especially important in ensuring tat te agent can accumulate knowledge over its lifetime across numerous tasks witout exibiting negative transfer We sow tat tis process is igly efficient wit robust teoretical guarantees We evaluate PG-ELLA on four dynamical systems, including an application to quadrotor control, and sow tat PG-ELLA outperforms standard policy gradients bot in te initial and final performance

2 Online Multi-ask Learning for Policy Gradient Metods 2 Related Work in Multi-ask RL Due to its empirical success, tere is a growing body of work on transfer learning approaces to reinforcement learning RL aylor & Stone, 2009 By contrast, relatively few metods for multi-task RL ave been proposed One class of algoritms for multi-task RL use nonparametric Bayesian models to sare knowledge between tasks For instance, Wilson et al 2007 developed a ierarcical Bayesian approac tat models te distribution over Markov decision processes MDPs and uses tis distribution as a prior for learning eac new task, enabling it to learn tasks consecutively In contrast to our work, Wilson et al focused on environments wit discrete states and actions Additionally, teir metod requires te ability to compute an optimal policy given an MDP is process can be expensive for even moderately large discrete environments, but is computationally intractable for te types of continuous, ig-dimensional control problems considered ere Anoter example is by Li et al 2009, wo developed a model-free multi-task RL metod for partially observable environments Unlike our problem setting, teir metod focuses on off-policy batc ML Finally, Lazaric & Gavamzade 2010 exploit sared structure in te value functions between related MDPs owever, teir approac is designed for on-policy multi-task policy evaluation, rater tan computing optimal policies A second approac to multi-task RL is based on Policy Reuse Fernández & Veloso, 2013, in wic policies from previously learned tasks are probabilistically reused to bias te learning of new tasks One drawback of Policy Reuse is tat it requires tat tasks sare common states, actions, and transition functions but allows different reward functions, wile our approac only requires tat tasks sare a common state and action space is restriction precludes te application of Policy Reuse to te scenarios considered in Section 7, were te systems ave related but not identical transition functions Also, in contrast to PG-ELLA, Policy Reuse does not support reverse transfer, were subsequent learning improves previously learned policies Peraps te approac most similar to ours is by Deisenrot et al 2014, wic uses policy gradients to learn a single controller tat is optimal on average over all training tasks By appropriately parameterizing te policy, te controller can be customized to particular tasks owever, tis metod requires tat tasks differ only in teir reward function, and tus is inapplicable to our experimental scenarios 3 Problem Framework We first describe our framework for policy gradient RL and lifelong learning e next section uses tis framework to present our approac to online ML for policy gradients 31 Policy Gradient Reinforcement Learning We frame eac SDM task as an RL problem, in wic an agent must sequentially select actions to maximize its expected return Suc problems are typically formalized as a Markov decision process MDP X, A, P, R, γ, were X R d is te potentially infinite set of states, A R m is te set of possible actions, P : X A X [0, 1] is a state transition probability function describing te system s dynamics, R : X A R is te reward function measuring te agent s performance, and γ [0, 1 specifies te degree to wic rewards are discounted over time At eac time step, te agent is in state x X and must coose an action a A, transitioning it to a new state x +1 px +1 x, a as given by P and yielding reward r +1 = Rx, a A policy π : X A [0, 1] is defined as a probability distribution over state-action pairs, were πa x represents te probability of selecting action a in state x e goal of an RL agent is to find an optimal policy π tat maximizes te expected return e sequence of state-action pairs forms a trajectory τ = [x 0:, a 0: ] over a possibly infinite orizon Policy gradient metods Sutton et al, 1999; Peters & Scaal, 2008; Peters & Bagnell, 2010 ave sown success in solving ig-dimensional problems, suc as robotic control Peters & Scaal, 2007 ese metods represent te policy π θ a x using a vector θ R d of control parameters e goal is to determine te optimal parameters θ tat maximize te expected average return: J θ = p θ τ Rτ dτ, 1 were is te set of all possible trajectories e trajectory distribution p θ τ and average per time step return Rτ are defined as: p θ τ = P 0 x 0 Rτ = 1 px +1 x, a π θ a x =1 r +1, =0 wit an initial state distribution P 0 : X [0, 1] Most policy gradient algoritms, suc as episodic REIN- FORCE Williams, 1992, PoWER Kober & Peters, 2011, and Natural Actor Critic Peters & Scaal, 2008, employ supervised function approximators to learn te control parameters θ by maximizing a lower bound on te expected return of J θ Eq 1 o acieve tis, one generates trajectories using te current policy π θ, and ten compares te result wit a new policy parameterized by θ As described by Kober & Peters 2011, te lower bound on te expected return can be attained using Jensen s inequality

3 Online Multi-ask Learning for Policy Gradient Metods and te concavity of te logaritm: log J θ = log p θτ Rτ dτ p θ τ = log p θ τ p θτ Rτ dτ p θ τ Rτ log p θτ dτ + constant p θ τ D KL pθ τ Rτ p θτ = J L,θ θ, were D KL pτ qτ = pτ log pτ dτ We qτ see tat tis is equivalent to minimizing te KL divergence between te reward-weigted trajectory distribution of π θ and te trajectory distribution p θ of te new policy π θ 32 e Lifelong Learning Problem In contrast to most previous work on policy gradients, wic focus on single-task learning, tis paper focuses on te online ML setting in wic te agent is required to learn a series of SDM tasks Z 1,, Z max over its lifetime Eac task t is an MDP Z t = S t, A t, P t, R t, γ t wit initial state distribution e agent will learn te tasks consecutively, acquiring multiple trajectories witin eac task before moving to te next e tasks may be interleaved, providing te agent te opportunity to revisit earlier tasks for furter experience, but te agent as no control over te task order We assume tat a priori te agent does not know te total number of tasks max, teir distribution, or te task order P t 0 e agent s goal is to learn a set of optimal policies Π = { } π,, π θ 1 θ wit corresponding parameters Θ = { θ 1,, θ max } At any time, te agent max may be evaluated on any previously seen task, and so must strive to optimize its learned policies for all tasks Z 1,, Z, were denotes te number of tasks seen so far 1 max 4 Online ML for Policy Gradient Metods is section develops te Policy Gradient Efficient Lifelong Learning Algoritm PG-ELLA 41 Learning Objective o sare knowledge between tasks, we assume tat eac task s control parameters can be modeled as a linear combination of latent components from a sared knowledge base A number of supervised ML algoritms Kumar & Daumé, 2012; Ruvolo & Eaton, 2013; Maurer et al, 2013 ave sown tis approac to be successful Our approac incorporates te use of a sared latent basis into policy gradient learning to enable transfer between SDM tasks PG-ELLA maintains a library of k latent components L R d k tat is sared among all tasks, forming a basis for te control policies We can ten represent eac task s control parameters as a linear combination of tis latent basis θ t = Ls t, were s t R k is a task-specific vector of coefficients e task-specific coefficients s t are encouraged to be sparse to ensure tat eac learned basis component captures a maximal reusable cunk of knowledge We can ten represent our objective of learning stationary policies wile maximizing te amount of transfer between task models by: e L= 1 [ min J θ t + µ s t 1 ]+ λ L 2 s t F, 2 were θ t = Ls t, te L 1 norm of s t is used to approximate te true vector sparsity, and F is te Frobenius norm e form of tis objective function is closely related to oter supervised ML metods Ruvolo & Eaton, 2013; Maurer et al, 2013, wit important differences troug te incorporation of J as we will examine sortly Our approac to optimizing Eq 2 is based upon te Efficient Lifelong Learning Algoritm ELLA Ruvolo & Eaton, 2013, wic provides a computationally efficient metod for learning L and te s t s online over multiple tasks in te case of supervised ML e objective solved by ELLA is closely related to Eq 2, wit te exception tat te J term is replaced wit a measure of eac task model s average loss over te training data in ELLA Since Eq 2 is not jointly convex in L and te s t s, most supervised ML metods use an expensive alternating optimization procedure to train te task models simultaneously Ruvolo & Eaton provide an efficient alternative to tis procedure tat can train task models consecutively, enabling Eq 2 to be used effectively for online ML In te next section, we adapt tis approac to te policy gradient framework, and sow tat te resulting algoritm provides an efficient metod for learning consecutive SDM tasks 42 Multi-ask Policy Gradients Policy gradient metods maximize te lower bound of J θ Eq 1 In order to use Eq 2 for ML wit policy gradients, we must first incorporate tis lower bound into our objective function Rewriting te error term in Eq 2 in terms of te lower bound yields e L= 1 [ θt ] min J L,θ + µ s t 1 + λ L 2 s t F, were θ t = Ls t owever, we can note tat θt J L,θ p θ tτ R t pθ tτ R t τ τ log dτ p θtτ τ t

4 Online Multi-ask Learning for Policy Gradient Metods θt erefore, maximizing te lower bound of J L,θ is o compute te second-order aylor θt representation, te equivalent to te following minimization problem: first and second derivatives of J L,θ wrt θ t are required e first derivative, θt J L,θ, is given by: θt min p θtτ R t pθ tτ R τ log t τ dτ θ t p θtτ p τ θtτ R t τ log p t θt θtτ dτ τ t Substituting te above result wit θ t = Ls t into Eq 2 leads to te following total cost function for ML wit policy gradients: {[ e L = 1 min p θ tτ R t τ s t τ t pθ tτ R t ] } 3 τ log dτ + µ s t 1 + λ L 2 p θtτ F Wile Eq 3 enables batc ML using policy gradients, it is computationally expensive due to two inefficiencies tat make it inappropriate for online ML: a te explicit dependence on all available trajectories troug J θ t = τ t p θ tτ R t τ dτ, and b te exaustive evaluation of a single candidate L tat requires te optimization of all s t s troug te outer summation ogeter, tese aspects cause Eq 3 and similarly Eq 2 to ave a computational cost tat depends on te total number of trajectories and total number of tasks, complicating its direct use in te lifelong learning setting We next describe metods for resolving eac of tese inefficiencies wile minimizing Eq 3, yielding PG-ELLA as an efficient metod for multi-task policy gradient learning In fact, we sow tat te complexity of PG-ELLA in learning a single task policy is independent of a te number of tasks seen so far and b te number of trajectories for all oter tasks, allowing our approac to be igly efficient 421 ELIMINAING DEPENDENCE ON OER ASKS As mentioned above, one of te inefficiencies in minimizing e L is its dependence on all available trajectories for all tasks o remedy tis problem, as in ELLA, we approximate e L by performing a second-order aylor expansion of J L,θ θ t around te optimal solution: { α t = arg min θ t p θ tτ R t τ τ t pθ tτ R t } τ log dτ p θtτ As sown by Ruvolo & Eaton 2013, te second-order aylor expansion can be substituted into te ML objective function to provide a point estimate around te optimal solution, eliminating te dependence on oter tasks wit: log p θtτ = log p t x t 0 t + =0 t + =0 p t x t +1 xt, at log π θt a t xt erefore: θt θtj L,θ = p θ tτ R t τ τ t t θt log π a t θt xt dτ =1 t = E log π θt θt =1 a t xt R t τ Policy gradient algoritms determine α t = θ t by following θt te above gradient e second derivative of J L,θ can be computed similarly to produce: θt 2 θt, θ J t L,θ = p θ tτ R t τ τ t 2 θt, θ log π t θt t =1 a t xt dτ We let Γ t = 2 θt, θ tj θt L,θ represent te essian evaluated at α t : [ t ] Γ t = E R t τ 2 θt, θ log π a t t θt xt θ t =α t =1 Substituting te second-order aylor approximation into Eq 3 yields te following: ê L= 1 [ α min t Ls t 2 ] s s t Γ t+ µ t 1 +λ L 2 F, were v 2 A = v Av, te constant term was suppressed since it as no effect on te minimization, and te linear term was ignored since by construction α t is a minimizer Most importantly, te dependence on all available trajectories as been eliminated, remedying te first inefficiency 4

5 Online Multi-ask Learning for Policy Gradient Metods 422 COMPUING E LAEN SPACE e second inefficiency in Eq 3 arises from te procedure used to compute te objective function for a single candidate L Namely, to determine ow effective a given value of L serves as a common basis for all learned tasks, an optimization problem must be solved to recompute eac of te s t s, wic becomes increasingly expensive as grows large o remedy tis problem, we modify Eq 3 or equivalently, Eq 4 to eliminate te minimization over all s t s Following te approac used in ELLA, we optimize eac task-specific projection s t only wen training on task t, witout updating tem wen training on oter tasks Consequently, any canges to θ t wen learning on oter tasks will only be troug updates to te sared basis L As sown by Ruvolo & Eaton 2013, tis coice to update s t only wen training on task t does not significantly affect te quality of model fit as grows large Wit tis simplification, we can rewrite Eq 4 in terms of two update equations: s t arg min l L m, s, α t, Γ t 5 s L m+1 arg min L 1 l L, s t, α t, Γ t +λ L 2 F, 6 were L m refers to te value of te latent basis at te start of te m t training session, t corresponds to te particular task for wic data was just received, and l L, s, α, Γ = µ s 1 + α Ls 2 Γ o compute L m, we null te gradient of Eq 6 and solve te resulting equation to yield te updated column-wise vectorization of L as A 1 b, were: A = λi d k,d k + 1 b = 1 vec s t s t s t Γ t α t Γ t For efficiency, we can compute A and b incrementally as new tasks arrive, avoiding te need to sum over all tasks 43 Data Generation & Model Update Using te incremental form Eqs 5 6 of te policy gradient ML objective function Eq 3, we can now construct an online ML algoritm tat can operate in a lifelong learning setting In typical policy gradient metods, trajectories are generated in batc mode by first initializing te policy and sampling trajectories from te system Kober & Peters, 2011; Peters & Bagnell, 2010 Given tese trajectories, te policy parameters are updated, new Algoritm 1 PG-ELLA k, λ, µ 0, A zeros k d,k d, b zeros k d,1, L zeros d,k wile some task t is available do if isnewaskt ten + 1 t, R t getrandomrajectories else t, R t getrajectories α t A A s t s t Γ t b b vec s t α t Γ t end if Compute α t and Γ t from t, R t L reinitializeallzerocolumnsl s t arg min s l L, s, α t, Γ t A A + s t s t Γ t b b + vec s t α t Γ t L mat 1 A + λi k d,k d end wile 1 1 b trajectories are sampled from te system using te updated policy, and te procedure is ten repeated In tis work, we adopt a sligtly modified version of policy gradients to operate in te lifelong learning setting e first time a new task is observed, we use a random policy for sampling; eac subsequent time te task is observed, we sample trajectories using te previously learned α t Additionally, instead of looping until te policy parameters ave converged, we perform only one run over te trajectories Upon receiving data for a specific task t, PG-ELLA performs two steps to update te model: it first computes te task-specific projections s t, and ten refines te sared latent space L o compute s t, we first determine α t and Γ t using only data from task t e details of tis step depend on te form cosen for te policy, as described in Section 5 We can ten solve te L 1 -regularized regression problem given in Eq 5 an instance of te Lasso to yield s t In te second step, we update L by first reinitializing any zero-columns of L and ten following Eq 6 e complete PG-ELLA is given as Algoritm 1 5 Policy Forms & Base Learners PG-ELLA supports a variety of policy forms and base learners, enabling it to be used in a number of policy gradient settings is section describes ow two popular policy gradient metods can be used as te base learner in PG- ELLA In teory, any policy gradient learner tat can provide an estimate of te essian can be incorporated

6 Online Multi-ask Learning for Policy Gradient Metods 51 Episodic REINFORCE In episodic REINFORCE Williams, 1992, te stocastic policy for task t is cosen according a t = θ t x t + ɛ, wit ɛ N 0, σ 2, and so π a t xt N θ t x t, σ2 erefore, θt [ t θtj L,θ = E R t τ =1 σ 2 ] a t θ x t is used to minimize te KL-divergence, equivalently maximizing te total discounted pay-off e second derivative [ for episodic REINFORCE is given by ] Γ t t = E =1 σ 2 x t xt 52 Natural Actor Critic In episodic Natural Actor Critic enac, te stocastic policy for task t is cosen in a similar fasion to tat of REINFORCE: π a t xt N θ t x t, σ2 e cange in te probability distribution is measured by a KL-divergence tat is approximated using a second-order expansion to incorporate te Fiser information matrix Accordingly, te gradient follows: θ J = G 1 θ J θ, were G denotes te Fiser information matrix e essian can be computed in a similar manner to te previous section For details, see Peters & Scaal eoretical Results & Computational Cost ere, we provide teoretical results tat establis tat PG- ELLA converges and tat te cost in terms of model performance for making te simplification from Section 421 is asymptotically negligible We proceed by first stating teoretical results from Ruvolo & Eaton 2013, and ten sow tat tese teoretical results apply directly to PG- ELLA wit minimal modifications First, we define: ĝ L = 1 l L, s t, α t, Γ t +λ L 2 F Recall from Section 421, tat te leftand side of te preceding equation specifies te cost of basis L if we leave te s t s fixed ie, we only update tem wen we receive training data for tat particular task We are now ready to state te two results from Ruvolo & Eaton 2013: Proposition 1: e latent basis becomes more stable over time at a rate of L +1 L = O 1 Proposition 2: 1 ĝ L converges almost surely; 2 ĝ L e L converges almost surely to 0 Proposition 2 establises tat te algoritm converges to a fixed per-task loss on te approximate objective function ĝ and te objective function tat does not contain te simplification from Section 421 Furter, Prop 2 establises tat tese two functions converge to te same value e consequence of tis last point is tat PG-ELLA does not incur any penalty in terms of average per-task loss for making te simplification from Section 421 e two propositions require te following assumptions: 1 e tuples Γ t, α t are drawn iid from a distribution wit compact support bounding te entries of Γ t and α t 2 For all L, Γ t, and α t, te smallest eigenvalue of L γ Γ t L γ is at least κ wit κ > 0, were γ is te subset of non-zero indices of te vector s t = arg min s α t Ls 2 Γ t In tis case te non-zero elements of te unique minimizing s t are given by: s t γ = L γ Γ t 1 L γ L γ Γ t α t µɛ γ, were ɛ γ is a vector containing te signs of te non-zero entries of s t e second assumption is a mild condition on te uniqueness of te sparse coding solution e first assumption can be verified by assuming tat tere is no sequential dependency of one task to te next Additionally, te fact tat Γ t is contained in a compact region can be verified for te episodic REINFORCE algoritm by looking at te form of te essian and requiring tat te time orizon t is finite Using a similar argument we can see tat te magnitude of te gradient for episodic REINFORCE is also bounded wen t is finite If we ten assume tat we make a finite number of updates for eac task model we can ensure tat te sum of all gradient updates is finite, tus guaranteeing tat α t is contained in a compact region Computational Complexity: Eac update begins by running a step of policy gradient to update α t and Γ t We assume tat te cost of te policy gradient update is Oξd, n t, were te specific cost depends on te particular policy algoritm employed and n t is te number of trajectories obtained for task t at te current iteration o complete te analysis, we use a result from Ruvolo & Eaton 2013 tat te cost of updating L and s t is Ok 2 d 3 is gives an overall cost of Ok 2 d 3 +ξd, n t for eac update 7 Evaluation We applied PG-ELLA to learn control policies for te four dynamical systems sown in Figure 1, including tree mecanical systems and an application to quadrotor control We generated multiple tasks by varying te parameterization of eac system, yielding a set of tasks from eac domain wit varying dynamics For example, te simple mass spring damper system exibits significantly iger oscillations as te spring constant increases Notably, te opti-

7 Online Multi-ask Learning for Policy Gradient Metods, i F x, x i 3, 3 i 1, 1 i able 1 System parameter ranges used in te experiments F SM k [1, 10] d [001, 02] m [05, 5] x, x i F2 2, 2 i e11 e21 e31 F1 rol r e 2B e1 F F3 B e3b x, x i CP & 3CP mc [05, 15] mp [01, 02] l [02, 08] d [001, 009] 3CP l1 [03, 05] l2 [02, 04] l3 [01, 03] 3CP d1 [01, 02] d2 [001, 002] d3 [01, 02] Ii 10 6, 10 4 l 711 E XPERIMENAL P ROOCOL pit c F4 yaw Figure 1 e four dynamical systems: a simple mass spring damper top-left, b cart-pole top-rigt, c tree-link inverted pendulum bottom-left, and d quadrotor bottom-rigt mal policies for controlling tese systems vary significantly even for only sligt variations in te system parameters 71 Bencmark Dynamical Systems We evaluated PG-ELLA on tree bencmark dynamical systems In eac domain, te distance between te current state and te goal position was used as te reward function Simple Mass Spring Damper: e simple mass SM system is caracterized by tree parameters: te spring constant k in N/m, te damping constant d in Ns/m, and te mass m in kg e system s state is given by te position x and velocity x of te mass, wic vary according to a linear force F e goal is to design a policy for controlling te mass to be in a specific state gref = xref, x ref i In our experiments, te goal state varied from being gref = 0, 0i i to gref = i, 0i, were i {1, 2,, 5} Cart-Pole: e cart-pole CP system as been used extensively as a bencmark for evaluating RL algoritms Bus oniu et al, 2010 CP dynamics are caracterized by te cart s mass mc in kg, te pole s mass mp in kg, te pole s lengt l in meters, and a damping parameter d in Ns/m e state is caracterized by te position x and velocity x of te cart, as well as te angle θ and angular velocity θ of te pole e goal is to design a policy capable of controlling te pole in an uprigt position ree-link Inverted Pendulum: e tree-link CP 3CP is a igly nonlinear and difficult system to control e goal is to balance tree connected rods in an uprigt position by moving te cart e dynamics are parameterized by te mass of te cart mc, rod mass mp,i, lengt li, inertia Ii, and damping parameters di, were i {1, 2, 3} represents te index for eac of te tree rods e system s state is caracterized by an eigt-dimensional vector, consisting of te position x and velocity x of te cart, and te angle {θi }3i=1 and angular velocity {θ i }3i=1 of eac rod We first generated 30 tasks for eac domain by varying te system parameters over te ranges given in able 1 ese parameter ranges were cosen to ensure a variety of tasks, including tose tat were difficult to control wit igly caotic dynamics We ten randomized te task order wit repetition and PG-ELLA acquired a limited amount of experience in eac task consecutively, updating L and te st s after eac session At eac learning session, PGELLA was limited to 50 trajectories for SM & CP or 20 trajectories for 3CP wit 150 time steps eac to perform te update Learning ceased once PG-ELLA ad experienced at least one session wit eac task o configure PG-ELLA, we used enac Peters & Scaal, 2008 as te base policy gradient learner e dimensionality k of te latent basis L was cosen independently for eac domain via cross-validation over 10 tasks e stepsize for eac task domain was determined by a line searc after gatering 10 trajectories of lengt 150 o evaluate te learned basis at any point in time, we initialized policies for eac task using θ t = Lst for t = {1,, } Starting from tese initializations, learning on eac task commenced using enac e number of trajectories varied among te domains from a minimum of 20 on te simple mass system to a maximum of 50 on te quadrotors e lengt of eac of tese trajectories was set to 150 time steps across all domains We measured performance using te average reward computed over 50 episodes of 150 time steps, and compared tis to standard enac running independently wit te same settings 712 R ESULS ON E B ENCMARK S YSEMS Figure 2 compares PG-ELLA to standard policy gradient learning using enac, sowing te average performance on all tasks versus te number of learning iterations PGELLA clearly outperforms standard enac in bot te initial and final performance on all task domains, demonstrating significantly improved performance from ML We evaluated PG-ELLA s performance on all tasks using te basis L learned after observing various subsets of tasks, from observing only tree tasks 10% to observing all 30 tasks 100% ese experiments assessed te quality of te learned basis L on bot known as well as unknown tasks, sowing tat performance increases as PG-ELLA

8 Online Multi-ask Learning for Policy Gradient Metods Average Reward PG ELLA, 100% tasks observed PG ELLA, 50% tasks observed PG ELLA, 30% tasks observed PG ELLA, 10% tasks observed Standard Policy Gradients Average Reward PG ELLA, 100% tasks observed PG ELLA, 50% tasks observed PG ELLA, 30% tasks observed PG ELLA, 10% tasks observed Standard Policy Gradients Average Reward PG ELLA, 100% tasks observed PG ELLA, 50% tasks observed PG ELLA, 30% tasks observed PG ELLA, 10% tasks observed Standard Policy Gradients Iterations a Simple Mass Spring Damper Iterations b Cart-Pole Iterations c ree-link Inverted Pendulum Figure 2 e performance of PG-ELLA versus standard policy gradients enac on te bencmark dynamical systems learns more tasks Wen a particular task was not observed, te recent L wit a zero initialization of s t was used o assess te difference in total number of trajectories between PG-ELLA and enac, we also tried giving enac an additional 50 trajectories of lengt 150 time steps at eac iteration owever, its overall performance did not cange 72 Quadrotor Control We also evaluated PG-ELLA on an application to quadrotor control, providing a more callenging domain e quadrotor system is illustrated in Figure 1, wit dynamics influenced by inertial constants around e 1,B, e 2,B, and e 3,B, trust factors influencing ow te rotor s speed affects te overall variation of te system s state, and te lengt of te rods supporting te rotors Altoug te overall state of te system can be described by a nine-dimensional vector, we focus on stability and so consider only six of tese state variables e quadrotor system as a ig-dimensional action space, were te goal is control te four rotational velocities {w i } 4 i=1 of te rotors to stabilize te system o ensure realistic dynamics, we used te simulated model described by Bouabdalla 2007, wic as been verified on and used in te control of a pysical quadrotor o produce multiple tasks, we generated 15 quadrotor systems by varying eac of: te inertia around te x- axis I xx [45e 3, 65e 3 ], inertia around te y-axis I yy [42e 3, 52e 3 ], inertia around te z-axis I zz [15e 2, 21e 2 ], and te lengt of te arms l [027, 03] In eac case, tese parameter values ave been used by Bouabdalla 2007 to describe pysical quadrotors We used a linear quadratic regulator, as described by Bouabdalla, to initialize te policies in bot te learning ie, determining L and s t and testing ie, comparing to standard policy gradients pases We followed a similar experimental procedure to evaluate PG-ELLA on quadrotor control, were we used 50 trajectories of 150 time steps to perform an enac policy gradient update eac learning session Figure 3 compares PG-ELLA to standard policy gradients enac on quadrotor control As on te bencmark sys- Average Reward 1 x PG ELLA, 100% tasks observed PG ELLA, 50% tasks observed PG ELLA, 30% tasks observed PG ELLA, 10% tasks observed Standard Policy Gradients Iterations Figure 3 Performance on quadrotor control tems, we see tat PG-ELLA clearly outperforms standard enac in bot te initial and final performance, and tis performance increases as PG-ELLA learns more tasks e final performance of te policy learned by PG-ELLA after observing all tasks is significantly better tan te policy learned using standard policy gradients, sowing te benefits of knowledge transfer between tasks Most importantly for practical applications, by using te basis L learned over previous tasks, PG-ELLA can acieve ig performance in a new task muc more quickly wit fewer trajectories tan standard policy gradient metods 8 Conclusion & Future Work PG-ELLA provides an efficient mecanism for online ML of SDM tasks wile providing improved performance over standard policy gradient metods By supporting knowledge transfer between tasks via a sared latent basis, PG- ELLA is also able to rapidly learn policies for new tasks, providing te ability for an agent to rapidly adapt to new situations In future work, we intend to explore te potential for cross-domain transfer wit PG-ELLA Acknowledgements is work was partially supported by ONR N , AFOSR FA , and NSF IIS We tank te reviewers for teir elpful suggestions

9 Online Multi-ask Learning for Policy Gradient Metods References Bócsi, B, Csato, L, and Peters, J Alignment-based transfer learning for robot models In Proceedings of te 2013 International Joint Conference on Neural Networks IJCNN, 2013 Bou-Ammar,, aylor, ME, uyls, K, Driessens, K, and Weiss, G Reinforcement learning transfer via sparse coding In Proceedings of te 11t Conference on Autonomous Agents and Multiagent Systems AAMAS, 2012 Bouabdalla, S Design and control of quadrotors wit application to autonomous flying PD tesis, École polytecnique fédérale de Lausanne, 2007 Buşoniu, L, Babuška, R, De Scutter, B, and Ernst, D Reinforcement Learning and Dynamic Programming Using Function Approximators CRC Press, Boca Raton, Florida, 2010 Daniel, C, Neumann, G, Kroemer, O, and Peters, J Learning sequential motor tasks In Proceedings of te 2013 IEEE International Conference on Robotics and Automation ICRA, 2013 Deisenrot, MP, Englert, P, Peters, J, and Fox, D Multi-task policy searc for robotics In Proceedings of te 2014 IEEE International Conference on Robotics and Automation ICRA, 2014 Fernández, F and Veloso, M Learning domain structure troug probabilistic policy reuse in reinforcement learning Progress in AI, 21:13 27, 2013 Kober, J and Peters, J Policy searc for motor primitives in robotics Macine Learning, 841 2, July 2011 Kumar, A and Daumé III, Learning task grouping and overlap in multi-task learning In Proceedings of te 29t International Conference on Macine Learning ICML, 2012 Kupcsik, AG, Deisenrot, MP, Peters, J, and Neumann, G Data-efficient generalization of robot skills wit contextual policy searc In Proceedings of te AAAI Conference on Artificial Intelligence AAAI, 2013 Lazaric, A and Gavamzade, M Bayesian multi-task reinforcement learning In Proceedings of te 27t International Conference on Macine Learning ICML, 2010 Peters, J and Scaal, S Natural actor-critic Neurocomputing, : , 2008 Rai, P and Daumé III, Infinite predictor subspace models for multitask learning In Proceedings of te 26t Conference on Uncertainty in Artificial Intelligence UAI, 2010 Ruvolo, P and Eaton, E ELLA: An efficient lifelong learning algoritm In Proceedings of te 30t International Conference on Macine Learning ICML, 2013 Sutton, RS, McAllester, DA, Sing, SP, and Mansour, Y Policy gradient metods for reinforcement learning wit function approximation In Neural Information Processing Systems NIPS, pp , 1999 aylor, ME, and Stone, P ransfer learning for reinforcement learning domains: a survey Journal of Macine Learning Researc, 10: , 2009 aylor, ME, Witeson, S, and Stone, P ransfer via inter-task mappings in policy searc reinforcement learning In Proceedings of te 6t International Joint Conference on Autonomous Agents and Multiagent Systems AAMAS, 2007 aylor, ME, Kulmann, G, and Stone, P Autonomous transfer for reinforcement learning In Proceedings of te 7t International Joint Conference on Autonomous Agents and Multiagent Systems AAMAS, pp , 2008 run, S and O Sullivan, J Discovering structure in multiple learning tasks: te C algoritm In Proceedings of te 13t International Conference on Macine Learning ICML, 1996 Williams, RJ Simple statistical gradient-following algoritms for connectionist reinforcement learning Macine Learning, 8: , 1992 Wilson, A, Fern, A, Ray, S, and adepalli, P Multi-task reinforcement learning: a ierarcical Bayesian approac In Proceedings of te 24t International Conference on Macine Learning ICML, pp , 2007 Zang, J, Garamani, Z, and Yang, Y Flexible latent variable models for multi-task learning Macine Learning, 733: , 2008 Li,, Liao, X, and Carin, L Multi-task reinforcement learning in partially observable stocastic environments Journal of Macine Learning Researc, 10: , 2009 Liu, Y and Stone, P Value-function-based transfer for reinforcement learning using structure mapping In Proceedings of te 21st National Conference on Artificial Intelligence AAAI, pp , 2006 Maurer, A, Pontil, M, and Romera-Paredes, B Sparse coding for multitask and transfer learning In Proceedings of te 30t International Conference on Macine Learning ICML, 2013 Peters, J and Bagnell, JA Policy gradient metods Encyclopedia of Macine Learning, pp , 2010 Peters, J and Scaal, S Applying te episodic natural actor-critic arcitecture to motor primitive learning In Proceedings of te 2007 European Symposium on Artificial Neural Networks ESANN, 2007