A Behavior Based Kernel for Policy Search via Bayesian Optimization

Transcription

1 via Bayesian Optimization Aaron Wilson Alan Fern Prasad Tadepalli Oregon State University School of EECS, 1148 Kelley Engineering Center, Corvallis, OR Abstract We expand on past successes applying Bayesian Optimization (BO) to the Reinforcement Learning (RL) problem. BO is a general method of searching for the maximum of an unknown objective function. The BO method explicitly aims to reduce the number of samples needed to identify the optimal solution by exploiting a probabilistic model of the objective function. Much work in BO has focused on Gaussian Process (GP) models of the objective. The performance of these models relies on the design of the kernel function relating points in the solution space. Unfortunately, previous approaches adapting ideas from BO to the RL setting have focused on simple kernels that are not well justified in the RL context. We show that a new kernel can be motivated by examining an upper bound on the absolute difference in expected return between policies. The resulting kernel explicitly compares the behaviors of policies in terms of the trajectory probability densities. We incorporate the behavior based kernel into a BO algorithm for policy search. Results reported on four standard benchmark domains show that our algorithm significantly outperform alternative state-of-the-art algorithms. 1. Introduction In the policy search setting, RL agents seek an optimal policy within a fixed set. In such a setting an agent executes a sequence of policies searching for the true optimum. Naturally, future policy selection decisions should benefit from the information available in all samples. A question arises regarding how the expected return of untried policies can be estimated using the batch of samples, and how to best use Appearing in ICML 2011 Workshop: Planning and Acting with Uncertain Models, Bellevue, WA, USA, Copyright 2011 by the author(s)/owner(s). the estimated returns to perform policy search. In this work we propose explicitly constructing a probabilistic model of the expected return informed by observations of past policy behaviors. We exploit this probabilistic model of the return by selecting new policies predicted to best improve on the performance of policies in the sample set. Our approach is based on adapting black box Bayesian Optimization (BO) to the RL problem. BO is a method of sequentially planning a sequence of queries from an unknown objective function for purposes of seeking the maximum. It is an ideal method for tackling the basic problem of policy search as it directly confronts the fundamental issue of trading off exploration of the objective function (global searches) with exploitation (local searches). Fundamental to application of Bayesian Optimization techniques is the definition of a Bayesian prior distribution for the objective function. The method of BO searches this surrogate representation of the objective function for maximal points instead of directly querying the true objective. Hopefully, by using a large number of surrogate function evaluations (trading computational resources for higher quality samples) the true maximum point can be identified with few queries to the true objective. As in most Bayesian methods success of the BO technique rests on the quality of the modeling effort. How should the objective, the expected return in the RL case, be effectively modeled? In this work, similar to past efforts applying BO to RL, we focus on GP models of the expected return. The generalization performance of GP models, and hence the performance of the BO technique, is strongly impacted by the definition of the kernel function which encodes a notion of relatedness between points in the function space. When applying BO to RL this means encoding a notion of similarity between policies. Past work has used simple kernels to relate policy parameters (for instance squared exponential kernels (Lizotte et al., 2007; Wilson et al., 2010)). Unfortunately, the selected kernels fail to account for the special properties of sequential decision processes typical of RL problems. A more appropriate notion of relatedness is needed for the RL context. We propose that policies are

2 better related by their behavior rather than their parameters. Below we motivate our behavior-based kernel function. We then discuss how to incorporate the kernel into a BO approach when a sparse sample of policy trajectories are available. Empirically we demonstrate that the behaviorbased Kernel significantly improves BO and outperforms a selection of standard algorithms on four benchmark domains. 2. Problem Setting We study the Reinforcement Learning problem in the context of Markov Decision Processes (MDPs). MDPs are described by a tuple (S, A, P, P 0, R, π). We consider processes with continuous state and action values. Where each state and action is a vector s R n, and a R. The transition function P is a probability distribution P (s t s t 1, a t 1 ) defining the response of the process to the agents action selections. Distribution P 0 gives the probability of beginning in a particular state. The reward function R(s, a) returns a numeric value representing the immediate reward for the state action pair (we do not consider stochastic reward functions). Finally, the function π is a stochastic mapping from states to actions P π (a φ(s), θ). It is a function of a vector of parameters θ R k, and features of the state φ(s). We are interested in episodic average reward RL. Define the trajectory density, P (ξ θ) = P 0 (s 0 ) T P (s t s t 1, a t 1 )P π (a t 1 φ(s t 1 ), θ), t=1 and the value of a trajectory R(ξ) = T t=0 R(s t, a t ). The variable T is assumed to have a maximum value insuring that all trajectories have finite length. Generalizations of our efforts to the infinite horizon case is possible, but is not a focus of this work. We define the expected return in terms of the integral over paths, η(θ) = R(ξ)P (ξ; θ)dξ. The basic policy search problem is to identify the policy parameters that maximize this expectation, arg max θ η(θ). 3. Policy Search via Bayesian Optimization Bayesian optimization addresses the general problem of maximizing a real valued function, θ = arg max η(θ). θ BO is a global method for tackling expensive objective functions by explicitly reducing the number of evaluations needed before the maximum is found. In BO the objective function is treated like a random variable, is modeled by a probability distribution P (η), and the uncertainty encoded by this distribution is employed to select which points will be used to query the objective. The principle idea is to use a large number of surrogate evaluations to reduce the number of expensive evaluations of the true objective. The BO method is a form of active learning. Given the objective function prior distribution BO proceeds iteratively. A point is selected according to some criteria (the selection criteria is a function of the posterior), the point is evaluated (a policy in our case), the posterior distribution is updated using the data, a new point is selected, and so on. As an active method of learning the criteria for selection of new query points plays a critical role in the quality of the posterior estimation of the surface, and the speed of identifying the maximum. Any selection criteria must address the trade off of exploration and exploitation. Because the Bayes optimal selection criteria is computationally intractable a heuristic method of selection must be used. A common heuristic called Maximum Expected Improvement (MEI) is the method of selection used in this work. Suppose we have a collection of n points in the θ space, and their associated objective function values. Define η max to be the point with highest observed return in the data set. Consider the following function, I(θ) = max{0, η(θ) η max }, which returns the amount by which the point θ exceeds the observed maximum. MEI searches for the maximum of the expectation of this improvement function with respect to the posterior uncertainty P (η(θ) D) given observed data D, θ n+1 = arg max E P (η(θ) D) [I(θ)]. θ Conveniently, closed form solutions exist for the Expected Improvement. By incorporating the posterior uncertainty into the selection process MEI is guaranteed to explore regions of sufficiently high uncertainty. Clearly, when the conditional posterior distribution has sufficient probability mass above the current maximum the EI will be positive, pushing the algorithm to execute experiments in new regions. Due to its empirical success the MEI criterion, originally proposed by (Mockus, 1994), has become the standard choice in most work on BO. Recent work has also established the convergence properties of iteratively selecting points using MEI with GP prior (Vazquez & Bect, 2010) lending further weight to its continued use. Crucial to the performance of the BO method is the definition of the objective function prior distribution. Our objective in the policy search problem is maximization of the

3 Algorithm 1 Bayesian Optimization Algorithm for RL 1: Let D 1:n = {η(θ i), ξ i} n i=1. 2: Compute the matrix of covariances K. 3: Select the next point in the policy space to evaluate: θ n+1 = arg max θ E P (η(θ) D) (I(θ) D 1:n). 4: Execute the policy θ n+1 for E episodes. 5: Compute Monte Carlo estimate of expected return ˆη(θ n+1) = 1 R(ξ) E ξ ξ n+1 6: Update D 1:n+1 = D 1:n (ˆη(θ n+1), ξ n+1) 7: Return to step 2. expected return. And it is this quantity that we model using the GP. GPs are defined by a mean and covariance function, η(θ) GP (m(θ), K(θ, θ)). The mean m(θ) encodes prior assumptions about the underlying function space (frequently assumed to be zero). The covariance matrix K(θ, θ) encodes relationships between points in the function space. Substantial engineering efforts have been devoted to developing meaningful kernel functions, K(θ i, θ j ), for a variety of domains due to the kernel s impact on generalization performance. Consider the basic Bayesian Optimization Algorithm 1. Line 1 assumes a batch of data of the form, {ˆη(θ i ), ξ i } and we denote the full set of observations from all past policies D 1:n = {ˆη(θ i ), ξ i } n i=1. We write ˆη to indicate a Monte Carlo estimate of the expected return for policy θ, and ξ i indicates the set of trajectories used in the Monte Carlo estimate. Given this data the surface of the expected return is modeled using the GP prior. For the moment we leave aside the computation of the covariance function in line 2. For purposes of maximizing the expected improvement, line 3, the GP posterior distribution, P (f(θ D 1:n )), must be computed. In the GP model this posterior has a simple form. Given the data D 1:n let y be the vector of outputs such that, y i = [η(θ i )] and let K(θ, θ)) be the covariance matrix with elements K(θ i, θ j ). Consequently the conditional posterior distribution is Gaussian with mean, and variance, µ(η(θ n+1 ) D 1:n ) = k(θ n+1, θ)k(θ, θ) 1 y, σ 2 (η(θ n+1 ) D 1:n ) = k(θ n+1, θ n+1 ) k(θ n+1, θ)k(θ, θ) 1 k(θ, θ n+1 ). k(θ n+1, θ) is the vector of similarities between the new point and all previously observed points, and k(θ, θ n+1 ) is its transpose. It is at line 3 that the selection of the kernel function has its impact. The kernel controls how the information in the sample is generalized to new points, and therefore the quality of the points returned by the optimization. Our work is an effort to improve the generalization performance of the GP by defining a meaningful kernel for the RL context. We discuss the motivation for our kernel and its estimation below Behavior-based Kernel It turns out that a simple bound relates the difference in returns of two policies to the KL-divergence of the trajectory densities. Consider the difference in expected returns of two policies indexed by θ i, and θ j. The absolute value of this difference, η(θ i ) η(θ j ), has an upper bound expressed in terms of the Kullback Leibler (KL) divergence, ( ) P (ξ θi) KL(P (ξ θ i) P (ξ θ j)) = P (ξ θ i) dξ, P (ξ θ j) of the trajectory probability densities. Theorem [ 1. For any θ i, and θ j, η(θ i ) η(θ j ) Rmax KL(P (ξ θi 2 ) P (ξ θ j )) + ] KL(P (ξ θ j ) P (ξ θ i )). Proof. η(θ i ) η(θ j ) = = Rmax R(ξ)P (ξ θ i )dξ R(ξ)(P (ξ θ i ) P (ξ θ j )) dξ R(ξ)(P (ξ θ i ) P (ξ θ j )) dξ P (ξ θ i ) P (ξ θ j ) dξ R(ξ)P (ξ θ j ) dξ Rmax 2 KL(P (ξ θ i ) P (ξ θ j )) = Rmax 2 [ KL(P (ξ θ i ) P (ξ θ j )) ] + KL(P (ξ θ j ) P (ξ θ i )) Rmax 2 D(θ i, θ j ) The Rmax term, introduced in line 4, represents the maximal score for any finite length trajectory. The first introduction of the KL-divergence is justified by Pinsker s Inequality. Pinsker s inequality bounds from above the variational distance between two distributions, defined on arbitrary sets, by the divergence term shown above. The inequality states that 1 (V (P, Q))2 2 KL(P, Q) where V is the variational distance, (P (x) Q(x)) dx. The inequality was originally proposed in (Pinsker, 1964) with recent generalizations to other variational distances here (Reid & Williamson, 2009). The second to last line introduces the symmetric KL-divergence which bounds the standard divergence from above (KL(P, Q) 0). Importantly the bound is a symmetric positive measure of distance between policies. It bounds, from above, the absolute difference in expected value, and reaches zero only when the divergence is zero. Additionally, though the variational bound is strictly tighter than the divergence based bound reported here, computing the variational distance inherently requires knowledge of the domain transition models. Alternatively, the term of the KL-divergence is a ratio of path probabilities and can be computed with no knowledge of the domain model. This characteristic is important when learned models are not available. Our goal is to incorporate the final measure of policy relatedness into

4 the surrogate representation of the expected return. Unfortunately, the divergence function does not meet the standard requirements for a kernel ((Moreno et al., 2004)). To transform the bound into a valid kernel we first define a function, A Behavior Based Kernel for Policy Search D(θ i, θ j) = KL(P (ξ θ i) P (ξ θ j))+ KL(P (ξ θ j) P (ξ θ i)), and define the covariance function to be the negative exponential of D, K(θ i, θ j ) = exp( α D(θ i, θ j )). The kernel has a single scalar parameter α controlling its width. This is precisely what we sought, a measure of policy similarity which depends on the action selection decisions. The kernel compares behaviors not parameters. Though the variance of this estimate can be large it will not negatively impact exploration in our algorithm. Our empirical results show that errors in the divergence estimates, including the importance sampled estimates, do not negatively impact performance. 4. Results We report the performance of our algorithm in four benchmark RL tasks including mountain car, cart-pole balancing, 3-link planar arm, and an acrobot domain. We compare BOA with our behavior based kernel to three alternatives: The BOA with squared exponential kernel, Q- Learning with CMAC function approximation (Sutton & Barto, 1998), and LSPI (Lagoudakis et al., 2003) Estimation of the Kernel Function Values We propose using this kernel to improve the BO algorithm discussed above where the proposed kernel plays a role in lines 2 and 3 of the algorithm. Below we discuss using estimates of the divergence values. Computing the exact KL-divergence requires access to a model of the decision process. Even with a model in hand computing the integral over paths is itself a computationally demanding process. The divergence must be estimated. In this work we elect to use a simple Monte Carlo estimate of the divergence. The divergence between policy θ i and θ j is approximated by, ˆD(θ i, θ j ) = ξ ξi ( P (ξ θi ) P (ξ θ j ) ) + ξ ξ j ( ) P (ξ θj ), P (ξ θ i ) using a sparse sample of trajectories generated by each policy respectively (ξ i represents the set of trajectories generated by policy θ i ). Because of the definition of the trajectory density the term within the arithm reduces to a ratio of action selection probabilities, ( ) P (ξ θi ) = P (ξ θ j ) T t=1 easily computed without a model. ( ) Pπ (a t φ(s t ), θ i ), P π (a t φ(s t ), θ j ) A second problem arises when computing the Expected Improvement (Line 3 of the BOA). Computing the conditional predictive mean and covariance for new points requires evaluation of the kernel for policies which have no trajectories associated with them. Because we have no access to a model we elect to use an importance sampled estimate of the divergence, ˆD(θ new, θ j) = ( P (ξ θ new) P (ξ θnew) P (ξ θ j) P (ξ θ j) ξ ξ j ( ) P (ξ θj) +. P (ξ θ new) ) 4.1. Experiment Setup We detail the special requirements necessary to implement each algorithm in this section. The results reported below are averaged over 30 runs for the BOA implementations, and 300 runs for Q-learning and LSPI. The initial policy is always randomly initialized. Expected returns reported for the first episode represent the average performance of randomly generated policies. BOA with Behavior Based Kernel. To generate data stochastic policies are transformed into deterministic policies by executing the maximum probability action. Single trajectories generated from these policies are provided as data to the kernel function. The expected returns reported below are for these deterministic policies. Policies are treated as stochastic for purposes of computing the kernel function. As seen below this sparse sample is sufficient to distinguish policies using the behavior based kernel. Maximizing the EI is done using a gradient free black box optimizer called DIRECT (Jones et al., 1993) BOA. We compare to a BO algorithm with a squared exponential kernel, which was the kernel of choice in past work (Lizotte et al., 2007; Wilson et al., 2010). K(θ i, θ j ) = exp( 1 2 (θ i θ j ) T ρ(θ i θ j )), We were able to get positive results by tuning the ρ vector for each experiment. Reported results are for the best setting of this parameter. The mean function of the GP was set to zero. To generate data for the BOA stochastic policies are transformed into deterministic policies as described above. The expected returns reported below are for these deterministic policies. DIRECT is used to optimize the EI. Q-Learning with CMAC function approximation. The basis function set was identified by hand in each problem. Epsilon greedy exploration was used.

5 LSPI. LSPI results are reported in the cart-pole, acrobot, and mountain car tasks. We were unable to get reasonable results from LSPI in the planar arm domain Cart-Pole Domain In the cart-pole domain the agent attempts to balance a pole for a fixed allotment of time. Successful policies keep the agents cart within a fixed boundary and maintain stability throughout the episode. In this version of the domain the agent must keep the pole balanced for 1000 steps. The state includes the location of the cart, the cart velocity, the angle of the pole and the angular velocity of the pole. At each step the agent receives a positive reward plus a penalty for large pole angles and speeds. This reward promotes policies that minimize deviations from the ideal position. Finally, a successfully completed episode (balancing for the full 1000 steps) gives the agent a reward of 100. The policy search algorithms maximize a linear policy. Figure 1 shows the results for the cart pole domain. In this case the parameter of the divergence kernel is set to 1. We have performed analysis of the sensitivity to this parameter which cannot be set too small. When it is set below.1 the probability of convergence to the true optimal begins to fall to zero. Of course, to avoid this problem the value of the kernel parameter can always be set using by maximizing the likelihood of the data (Rasmussen & Williams, 2005). This was confirmed in additional experiments with the kernel parameter set to 3 and 10 respectively which continued to explore well after finding optimum points. Setting the parameter to 1 guaranteed convergence to the optimal policy for all of our runs and avoided unnecessary exploration. Clearly the divergence based kernel outperforms all of the competitors including the BOA with the squared exponential kernel. Importantly many policies generated by the standard BOA have similar behaviors in the cart-pole domain. The behavior based kernel avoids exploring these redundant behaviors resulting in quick convergence Planar Arm Domain In this domain the agent controls an articulated arm attempting to place the arm tip within a fixed goal location. Three arm joints are controlled by applying a small amount of torque (-1 or 1) which causes a kinematic response. Each arm segment is constrained to move through of rotation simulating constraints of a real machine. At each step the agent is penalized by the distance from the center of the goal to the tip of the arm. A istic controller is used for each joint. The state space for each controller is the distance between location of the arm tip and target. Figure 2 shows the results for the planar arm domain. The generalization of the divergence-based kernel is particularly powerful in this case. Much of the policy space is quickly identified to be redundant by our kernel. In this case the performance is not as responsive to the kernel parameter. We performed experiments setting α as high as 10 and still observed quick convergence to the maximum value. The experiment reported here has the value of α set to Mountain Car Domain The mountain car task is to accelerate a simulated car from an initial position within a basin of attraction to the peak of a slope. The problem is made difficult because the car does not have sufficient power to drive directly to the goal. The agent must generate momentum by backing up one hillside to have sufficient velocity to reach the opposite peak. At each step the mountain car agent receives a flat -1 reward, and a bonus reward of 100 if the agent reaches the peak. The agent controls the car with left, and right actions. A istic function is used to select actions. In Figure 3 we report the results. In this case we have elected to leave the other results off of the graph. The other methods were unable to find competitive policies with less than 500 samples (the standard BOA outperforms all of the other alternatives). This is due to the flat reward structure which leads to a plateaued objective function. This makes Mountain Car a perfect experiment for illustrating the importance of directed exploration based on differences in policy behavior. Approaches based on random exploration, and exploration weighted by returns are poorly suited to this kind of reward structure. The divergence kernel on the other hand generalizes between whole regions of the plateau directing search to policies likely to generate novel behaviors. Visual inspection of the performance of the standard BOA algorithm shows that many of the selected policies, unrelated according to the squared exponential kernel, actually produce the same action sequences when started at the initial state. This redundant search is completely avoided by using the behavior based kernel. We also provide plots indicating the sensitivity to the kernel parameter α. When the kernel parameter is set too low little exploration is performed and a suboptimal 130 step policy is found. When set to 10 much more exploration is performed, more than 200 additional episodes are sampled before settling on an optimal 117 step policy, which compares favorably to the 119 step policy found when the parameter is set to Acrobot Task In this domain the agent controls a simulated acrobot attached by the hands to a fixed location. The goal is to apply torque (-1,1) to the hips of the robot and swing the feet above a pre-specified threshold. The dynamics of the acrobot are constrained so that the bottom half of the agent

6 Figure 1. Cart-pole. cannot perform full revolutions. At each step the agent receives a flat -1 penalty and a bonus +100 if the goal height is reached. A istic policy is used to select actions. Figure 4 illustrates the results for the acrobot domain. The generalization performance is less pronounced in this case. The squared exponential kernel generalizes reasonably well. The reason for this can be observed in the behavior of random policies in the acrobot domain. The behaviors are erratic. Small changes in the parameters of the policy lead to very different action sequences. Therefore, more behaviors must be searched before good policies are identified. Even so the behavior based kernel does outperform the BOA with squared exponential kernel. 5. Related Work Work by (Kakade, 2001) presented a metric based on the Fisher information to derive the natural policy gradient update. Kakade was able to show significant results in a difficult Tetris domain outperforming standard gradient methods. Follow up work by (Bagnell & Schneider, 2003) proposed pursuing a related idea within the path integral framework for RL (the same framework of this paper). Their work considers metrics defined as functions on the distribution over trajectories generated by a fixed policy P (ξ θ). In contrast to our goals both works focus on iteratively improving a policy via gradient descent. Furthermore, no explicit attention is paid to using the metric information to guide the exploratory process. However, the insight that policy relationships should be functions of the trajectory density has played a key role in our work. Figure 2. Planar Arm. More closely related to our proposal is recent work in (Peters et al., 2010), and (Kober & Peters, 2010). (Peters et al., 2010) uses a divergence-based bound to control exploration. Specifically, they attempt to maximize the expected reward subject to a bound proportional to the KLdivergence between the empirically observed state-action distribution and the state-action distribution of the new policy. The search for a new policy is necessarily local, restricted by the bound to be close to the current policy. By contrast, our work uses the divergence as a measure of similarity allowing for a more aggressive search of the policy space. A related work (Kober & Peters, 2010) derives a lower bound on the importance sampled estimate of the expected return, as was done in (Dayan & Hinton, 1997), and observes the relationship to the KL-divergence of the reward weighted behavior policy and the target policy. They derive from this relationship an EM-based update for the policy parameters. An explicit effort is made to construct the update such that exploration is accounted for. However, their method of state-dependent exploration is still based on random perturbations of the action selection policy. Our method of exploration is instead determined by the posterior uncertainty, and does not depend on the behavior policy. In fact, the mean action of a stochastic policy, treating the stochastic policy as a deterministic one, can be used by our algorithm. This is an advantage when working with real physical systems where random perturbations can damage expensive equipment. 6. Conclusion We have examined how to improve policy search algorithms by constructing and exploiting a probabilistic model of the expected return objective function. Our work extends BO methods for policy search problems by constructing a behavior based kernel. We motivate our kernel by examining a simple upper bound on the absolute difference of expected returns. The resulting bound is a symmetric positive measure of distance between policies, and reaches zero only when the divergence is zero. We use this upper bound

7 Figure 3. Mountain Car. as the basis for our kernel function, argue that the properties of the bound insure a more reasonable measure of policy relatedness, and demonstrate empirically that the improved model of the objective substantially speeds exploration in some simple benchmark domains. Acknowledgments This research is supported by the Army Research Office and the Office of Naval Research. References Bagnell, J. Andrew (Drew) and Schneider, Jeff. Covariant policy search. In IJCAI, August Dayan, Peter and Hinton, Geoffrey E. Using expectationmaximization for reinforcement learning. Neural Computation, 9: , February Jones, D. R., Perttunen, C. D., and Stuckman, B. E. Lipschitzian optimization without the lipschitz constant. J. Optim. Theory Appl., 79(1): , Kakade, Sham. A natural policy gradient. In NIPS, Kober, Jens and Peters, Jan. Policy search for motor primitives in robotics. Machine Learning, pp. 1 33, Lagoudakis, Michail G., Parr, Ronald, and Bartlett, L. Least-squares policy iteration. Journal of Machine Learning Research, 4, Figure 4. Acrobot. Mockus, J. Application of bayesian approach to numerical methods of global and stochastic optimization. Global Optimization, 4(4): , Moreno, Pedro J., Ho, Purdy P., and Vasconcelos, Nuno. A kullback-leibler divergence based kernel for svm classification in multimedia applications. In NIPS, Peters, Jan, Mülling, Katharina, and Altun, Yasemin. Relative entropy policy search. In AAAI, Pinsker, M. Information and Information Stability of Random Variables and Processes. Holden-Day Inc, San Francisco, Translated by Amiel Feinstein. Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, ISBN X. Reid, Mark D. and Williamson, Robert C. pinsker inequalities. In COLT, Generalised Sutton, R.S. and Barto, A. G. Reinforcement Learning:An Introduction. MIT Press, Vazquez, Emmanuel and Bect, Julien. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference, 140(11): , Wilson, Aaron, Fern, Alan, and Tadepalli, Prasad. Incorporating domain models into bayesian optimization for rl. In ECML, Lizotte, Daniel, Wang, Tao, Bowling, Michael, and Schuurmans, Dale. Automatic gait optimization with gaussian process regression. In IJCAI, 2007.