Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method
|
|
|
- Tiffany Gibson
- 9 years ago
- Views:
Transcription
1 Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method Martin Riedmiller Neuroinformatics Group, University of Onsabrück, Osnabrück Abstract. This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron. Based on the principle of storing and reusing transition experiences, a model-free, neural network based Reinforcement Learning algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality. 1 Introduction When addressing interesting Reinforcement Learning (RL) problems in real world applications, one sooner or later faces the problem of an appropriate method to represent the value function. Neural networks, in particular multilayer perceptrons, offer an interesting perspective due to their ability to approximate nonlinear functions. Although a lot of successful applications exist [Tes92, Lin92, Rie00], also a lot of problems have been reported [BM95]. Many of these problems arise, since the representation mechanism in a multi-layer perceptron is not local, but global: A weight change induced by an update in a certain part of the state space might influence the values in arbitrary other regions - and therefore destroy the effort done so far in other regions. This leads to typically very long learning times or even to the final failure of learning at all. On the other hand, a global representation scheme can in principle have a very positive effect: by assigning similar values to related areas, it can exploit generalisation effects and therefore accelerate learning considerably. Therefore the question is: how can we exploit the positive properties of a global approximation realized in a multi-layer perceptron while avoiding the negative ones? One key access to this question is that we need to constrain the malificious influence of a new update of the value function in a multi-layer perceptron. The principle idea that underlies our approach is simple: we have to make sure, that at the same time we make an update at a new datapoint, we also offer previous knowledge explicitly. Here, we implement this idea by storing all previous experiences in terms of state-action transitions in memory. This data is then reused every time the neural Q-function is updated. J. Gama et al. (Eds.): ECML 2005, LNAI 3720, pp , c Springer-Verlag Berlin Heidelberg 2005
2 318 M. Riedmiller The algorithm proposed belongs to the family of fitted value iteration algorithms [Gor95]. They can be seen as a special form of the experience replay technique [Lin92], where value iteration is performed on all transition experiences seen so far. Recently, several algorithms have been introduced in this spirit of batch or off-line Reinforcement Learning, e.g. LSPI [LP03]. Our method is a special realisation of the Fitted Q Iteration, recently proposed by Ernst et.al [EPG05]. Whereas Ernst et.al examined tree based regression methods, we propose the use of multilayer-perceptrons with an enhanced weight update method. Our method is therfore called Neural Fitted Q Iteration (NFQ). In particular, we want to stress the following important properties of NFQ: the method is model-free. The only information required from the plant are transition triples of the form (state, action, successor state). learning of successful policies is possible with relatively few training examples (data efficiency). This enables the learning algorithm to directly learn from real world interactions. although requiring much less knowledge about the plant than analytical controllers, the method is able to find control policies, that are able to compare well to analytically designed controllers (see cart-pole regulator benchmark). 2 Main Idea 2.1 Markovian Decision Processes The control problems considered in this paper can be described as Markovian Decision Processes (MDPs). An MDP is described by a set S of states, a set A of actions, a stochastic transition function p(s, a, s ) describing the (stochastic) system behavior and an immediate reward or cost function c : S A R. The goal is to find an optimal policy π : S A, that minimizes the expected cumulated costs for each state. In particular, we allow S to be continuous, assume A to be finite for our learning system, and p to be unknown to our learning system (model-free approach). Decisions are taken in regular time steps with a constant cycle time. 2.2 Classical Q-Learning In classical Q-learning, theupdateruleisgivenby Q k+1 (s, a) :=(1 α)q(s, a)+α(c(s, a)+γ min Q k (s,b)) b where s denotes the state where the transition starts, a is the action that is applied, and s is the resulting state. α is a learning rate that has to be decreased in the course of learning in order to fulfill the conditions of stochastic approximation and γ is a discounting factor (see e.g. [SB98]). It can be shown, that under mild assumptions Q-learning converges for finite state and action spaces, as long as every state action pair is updated infinitely often. Then, in the limit, the optimal Q-function is reached. Typically, the update is performed on-line in a sample-by-sample manner, that is, every time a new transition is made, the value function is updated.
3 NFQ-First Experiences with a Data Efficient Neural RL Method Q-Learning for Neural Networks In principle, the above Q-learning rule can be directly implemented in a neural network. Since no direct assignment of Q-values like in a table based representation can be made, instead, an error function is introduced, that aims to measure the difference between the current Q-value and the new value that should be assigned. For example, a squared-error measure like the following can be used: error =(Q(s, a) (c(s, a)+γ min b Q(s,b))) 2.Atthispoint,commongradient descent techniques (like the backpropagation learning rule) can be applied to adjust the weights of a neural network in order to minimize the error. Like above, this update rule is typically applied after each new sample. The problem with this on-line update rule is, that typically, several ten thousands of episodes have to be done until an optimal or near optimal policy has been found [Rie00]. One reason for this is, that if weights are adjusted for one certain state action pair, then unpredictable changes also occur at other places in the state-action space. Although in principle this could also have a positive effect (generalisation) in many cases, in our experiences this seems to be the main reason for unreliable and slow learning. 3 Neural Fitted Q Iteration (NFQ) 3.1 Basic Idea The basic idea underlying NFQ is the following: Instead of updating the neural value function on-line (which leads to the problems described in the previous section), the update is performed off-line considering an entire set of transition experiences. Experiences are collected in triples of the form (s, a, s )byinteracting with the (real or simulated) system 1. Here, s is the original state, a is the chosen action and s is the resulting state. The set of experiences is called the sample set D. The consideration of the entire training information instead of on-line samples, has an important further consequence: It allows the application of advanced supervised learning methods, that converge faster and more reliably than online gradient descent methods. Here we use Rprop [RB93], a supervised learning method for batch learning, which is known to be very fast and very insensitive with respect to the choice of its learning parameters. The latter fact has the advantage, that we do not have to care about tuning the parameters for the supervised learning part of the overall (RL) learning problem. 1 Note that often experiences are collected in four-tuples with the additional entry denoting the immediate costs or reward from the environment. Since we take an engineering view of the learning problem, we think of the immediate costs as something being specified by the designer of the learning system rather than something that occurs naturally in the environment and can only be observed. Therefore, costs come in at a later point and also potentially can be changed without collecting further experiences. However, the basic working of the algorithm is not touched by this.
4 320 M. Riedmiller 3.2 The NFQ -Algorithm NFQ is an instance of the Fitted Q Iteration family of algorithms [EPG05], where the regression algorithm is realized by a multi-layer perceptron. The algorithm is displayed in figure 1. It consists of two major steps: The generation of the training set P and the training of these patterns within a multi-layer perceptron. The input part of each training pattern consists of the state s l and action a l of training experience l. The target value is computed by the sum of the transition costs c(s l,a l,s l+1 ) and the expected minimal path costs for the successor state s l, computed on the basis of the current estimate of the Q function, Q k. NFQ main() { input: a set of transition samples D; output: Q-value function Q N k=0 init MLP() Q 0; Do { generate pattern set P = {(input l,target l ),l =1,...,#D} where: input l = s l,u l, target l = c(s l,u l,s l )+γmin b Q k (s l,b) Rprop training(p ) Q k+1 k:= k+1 } While (k <N) Fig. 1. Main loop of NFQ Since at this point, training the Q-function can be done as batch learning of a fixed pattern set, we can use more advanced supervised learning techniques, that converge more quickly and more reliably than ordinary gradient descent techniques. In our implementation, we use the Rprop algorithm for fast supervised learning [RB93]. The training of the pattern set is repeated for several epochs (=complete sweeps through the pattern set), until the pattern set is learned succesfully. 3.3 Sample Setting of Costs Here, we will give an example setting of the immediate cost structure, which can be used in many typical reinforcement learning settings. We find it useful to use a more or less standardized procedure to setup the learning problem, but we want to stress that NFQ is by no means tailored this type of cost function, but works with arbitrary cost structures. In the following, we denote the set of goal states S +, the set of forbidden states are denoted by S. S + therefore denotes the region, where the system
5 NFQ-First Experiences with a Data Efficient Neural RL Method 321 should finally be controlled to (and in case of a regulator problem, should be kept in), and S denotes regions in state space, that must be avoided by a correct control policy. Within this setting, the generation of training patterns is modified as follows: c(s l,u l,s l ), if s l S + target l = C, if s l S (1) c(s l,u l,s l )+γmin b Q k (s l,b), else (standard case) Setting c(s l,u l,s l ) to a positive constant value c trans means to aim for a minimum-time controller. In technical process control, this is often desirable, and therefore we choose this setting in the following. C is set to 1.0, since this is the maximum output value of the multi-layer perceptron that we use. In regulator problems (see section 4), reaching a goal state does not terminate the episode. Therefore, the first line in the above equation must not be applied. Instead, only line 2 and 3 are executed and c(s l,u l,s l )=0, if s l S + and c(s l,u l,s l )=c trans,otherwise. Note that due to its purity, this setting is widely applicable and no prior knowledge about the environment (like for example the distance to the goal) is incorporated. 3.4 Variants Several variants can be applied to the basic algorithm. In particular, for the experiments in section 5.2 and 5.3 we used a version, where we incrementally add transitions to the experience set. This is especially useful in situations, where a reasonable set of experiences can not be collected by controlling the system with purely random actions. Instead, training samples are collected by greedily exploiting the current Q k function and added to the sample set D. Another heuristic that we found helpful, is to add artificial training patterns from the goal region, which have a known target value of 0. This technique clamps the neural value function to zero in the goal region, and we therefore call it the hint-to-goal-heuristic. Note that no additional prior knowledge is required to generate the patterns, since the goal region is already known in the task specification. 4 Benchmarking The following gives a short overview of the intention of the benchmarks done in the empirical section. 4.1 Types of Tasks In control problems, three basic types of task specification might be distinguished (there might be more, but for our purposes, this categorisation is sufficient):
6 322 M. Riedmiller avoidance control task - keep the system somewhere within the valid region of state space. Pole balancing is typically defined as such a problem, where the task is to avoid that the pole crashes or the cart hits the boundary of the track. reaching a goal - the system has to reach a certain area in state space. As soon as it gets there, the task is immediately finished. Mountaincar is typically defined as getting the cart to a certain position up the hill. regulator problem - the system has to reach a certain region in state space and has to be actively kept there by the controller. This corresponds to the problems typically tackled with methods of classical control theory. The problem types show different levels of difficulty, even when the underlying plant to be controlled is the same. In the following, we consider three benchmark problems, where each belongs to one of the above categories. 4.2 Evaluating Learning Performance Each learning experiment consists of a number of episodes. An episode is a sequence of control cycles, that starts with an initial state and ends if the current state fulfills some termination condition (e.g. the system reached its goal state or a failure occured) or some maximum number of cycles has been reached. Learning time in principle can be measured in many different ways: number of episodes needed, number of cycles needed, number of updates performed, absolute computation time, etc. Since we are interested in methods that can directly learn on real systems, our preferred measure of learning effort is the number of cycles needed to achieve a certain performance. This number is directly related to the amount of interaction with the plant to be controlled. By multiplying the number of cycles with the length of the control interval, we get the absolute real time that we would have to spend on a real system to achieve a certain performance. We also give the number of episodes that is needed to learn a task. Although this is not as expressive as the number of cycles (since this figure drastically depends on the maximum allowed length of a training episode), it is a commonly used measure and gives at least a rough intuition about the learning effort. 4.3 Evaluating Controller Performance Controller performance is evaluated with respect to some cost-measure, that evaluates the average performance over a certain amount of control episodes. In principle this cost measure can be chosen arbitrary. Due to its practical relevance, we use the average time to the goal as a performance measure for the controller. In the regulator problem case, we measure the overall time outside the target region. This takes the fact into account, that a controlled system might leave the target region again. Note that the learning controller might have an internal goal formulation that differs from the performance measure (i.e. by using discounting or shaping rewards).
7 NFQ-First Experiences with a Data Efficient Neural RL Method 323 Another important aspect when evaluating controller performance is to specify the working region of the controller, that means the set of starting states, for which the controller should work. We distinguish between the following types of working regions: always start from a single starting state start from one of a finite set of starting states start from an arbitrary random state within a starting region In the following experiments, we use the third case, which is the most general and (typically) the most challenging one. 5 Empirical Results All experiments are done using CLS 2 (ClosedLoopSystemSimulator) 2,asoftware system designed to benchmark (not only) RL controllers on a wide variety of plants. 5.1 The Pole Balancing Task The task is to balance a pole at the upright position by applying appropriate forces to the system. System equations and parameters are the same as in [LP03]. Three actions are available, left force (-50 N), right force (+50 N) and no force. Uniform noise in [ 10, 10] is added to the system. Cycle length is 0.1 s. The state space is continuous and consists of the angle and the angular velocity. An episode was counted as a failure, if the angle of the pole exceeded ±π/2 respectively. Learning System Setup. For comparison, we choose the same cost structure as in [LP03]: Immediate costs of 0 arise, if the angle remains within [ π/2,π/2] ( S + ), if the angle gets outside this region, the episode is stopped and costs of +1 are given. A discount factor of γ = 0.95 is used. Transition samples were generated by starting the pole in an upright position and then applying random control signals until failure. The average length of a training episode was about 6 cycles. NFQ uses a multilayer-perceptron with 3 inputs (2 for the state, 1 for the action), two hidden layers with 5 neurons each and 1 output. For all neurons, sigmoidal activation functions with outputs between 0 and 1 were used. Results. Lagoudakis and Parr reported very good results both for their LSPI approach and Q-learning with experience replay using a linear function approximator with reasonably selected basis functions [LP03]. The learned controllers were tested on 1000 test episodes with a maximum length of 300 seconds each. LSPI reached an average balancing time of 285 seconds after 1000 training episodes. This means, that most but not all of the training trials generated totally successful policies. For Q-learning with experience replay they report a balancing time of about 300 seconds after 750 episodes of training [LP03]. 2 available at clss.sf.net
8 324 M. Riedmiller Table 1. Results of NFQ on the pole balancing benchmark. Left column reports the number of random episodes that were used for training. The length of each episode was about 6 cycles. Altogether, 50 repetitions of the experiment were done. For each experiment, a new set of random episodes was produced. Using 200 or more training episodes (about 1200 cycles, corresponding to 2 minutes real time), all experiments generated successful policies, i.e. the controller balanced the pole for all the test cases for the maximum time 300 s. # random episodes successful learning trials 50 23/50 (46%) /50 (88 %) /50 (96 %) /50 (100 %) /50 (100 %) /50 (100 %) Results of the NFQ method are shown in table 1. The experiments were repeated for 50 times. Each experiment had a different set of training samples and a different initialisation of the neural network weights. With only 50 training episodes (corresponding to about 300 transition samples), NFQ was able to find totally successful policies (policies that balanced the pole for the full 300 seconds for all the test episodes) in 23 out of 50 experiments. Using more training episodes, the result improves. Using only 200 training episodes, a successful policy could be found reliably in all of the 50 experiments. This is a remarkable result with respect to training data efficiency and gives some hint to the benefit of generalisation ability of a multilayer-perceptron. 5.2 The Mountain Car Benchmark The mountain car benchmark is about accelerating a car up to the top of the hill, where for many situations the acceleration of the car is too weak to directly go to the top, but instead the car has to move to the other direction to get enough energy [SB98]. The control interval is t =0.05s. Actions are restricted within the interval [ 4, 4]. The road ends at -1m, i.e. the position must fulfill the constraint position > 1m. The task is to reach the top, which means that then, the position must be larger or equal to 0.7m. For testing performance, 1000 starting states are drawn randomly from the interval ( 1, 0.7). The initial velocity of the cart is set to 0. Performance is measured by the average number of cycles to the goal. Learning System Setup. Two actions are provided to the learning controller, -4 and +4. For training, initial starting positions are drawn randomly from ( 1, 0.7), the initial velocity of the car was always set to zero. Training trajectories had a maximum length of 50 cycles. An episode was stopped, if the system entered S (failure by constraint violation) or entered S + (success). Each training trajectory was generated by a controller, that greedily exploited the current
9 NFQ-First Experiences with a Data Efficient Neural RL Method 325 Q-value function. The Q-value function was represented by a multi-layer perceptron with 3 input neurons (2 state variables and 1 action), 2 layers of 5 hidden neurons each and 1 output neuron, all equiped with sigmoidal activation functions. The weights of the network were randomly initialized within [ 0.5, 0.5]. After each episode, one iteration of the inner NFQ loop was performed. The hint-to-goal heuristic was used with a factor of 100. For each transition, costs of c trans =0.01 were given. Results. Results of the NFQ approach on the mountain car benchmark are shown in table 2. The results are averaged over 20 experiments. Experiments differ in the randomly drawn starting states for training and the randomly initialized neural Q-function. Each trial was stopped after 500 training episodes. All 20 experiments produced a successful policy, i.e. a policy that was able to reach the goal state for all of the 1000 randomly drawn starting positions. To generate a successful policy, only about 71 episodes or 2777 cycles were needed in average over all experiments. This corresponds to less than 2 and a half minutes of training in real time. In the best case, a successful policy could be found in only 356 cycles, but even in worst case, only cycles were needed, which corresponds to about 8 and a half minutes in real time and therefore still is a very realistic number for an assumed interaction with a real system. Finding a fast policy to the goal can be done in about 296 episodes or about cycles respectively, corresponding to about 9 minutes in real time. Again, this is a very reasonable number for direct interaction with a real system. Table 2. Results of NFQ on the mountain car benchmark. The upper part reports on the training effort to reach a succesful policy. A policy is successful, if all test situations are controlled to the goal state. The table shows the figures for the average (best/ worst) number of episodes, the average (best/ worst) number of cycles and the corresponding time for interacting with a real system. The lower part reports on the learning effort to reach an optimized policy. In average over all training trials, the average best costs are This value is slightly better than the performance achieved with a fine granulated Q-table (29.0). Mountain Car First successful policy episodes cycles interaction time costs average m19s best worst Best policy found (within 500 episodes) episodes cycles interaction time costs average m06s 28.7 best worst
10 326 M. Riedmiller The best policies found needed only an average of 28.7 cycles to reach the goal. This figure compares well to a table-based Q-learning approach, which yielded an average of 29.0 cycles to reach the goal. This means that we can expect the NFQ controllers to be pretty close to the optimum. As a side remark (not meant as a true comparison): to get to this result, table based Q-learning required 300,000 episodes (with a maximum length of 300 cycles), and the Q-table had a resolution of entries. 5.3 The Cartpole Regulator Benchmark System dynamics of the cartpole system are described in [SB98]. The control interval is t =0.02s. Actions are restricted within the interval [ 10, 10]. The position is restricted by the constraint 2.4 pos 2.4. For testing performance, 1000 starting states are drawn randomly. Results on the cartpole system are typically reported with respect to maximum balancing time. Here, we report results on a more difficult task that comprises balancing, namely cartpole regulation. The task is to move the cart to a certain position andkeepitthere while preventing the pole from falling. The target position of the cart is the middle of the track, with a tolerance of ±0.05m. As a further complication, we allow initial starting states deviating a lot from the all-zero position: for testing performance, initial pole angles are randomly drawn from [ 0.3, 0.3] (in rad), positions are drawn from [ 1., 1.] (in m), initial velocities are set to 0. This more complicated formulation of the cartpole benchmark is closer to realistic control tasks and the resulting controllers can be compared to control policies derived by classical controller design methods. Learning System Setup. Two actions are available to the learning controller, -10N and +10N. For training, initial starting positions for the cart are drawn randomly from [ 2.3, 2.3], initial pole angles are drawn from [ 0.3, 0.3] (in rad), cart velocity and angular velocity are initially set to zero. Training episodes had a maximum length of 100 cycles. Each training episode was generated by a controller, that greedily exploited the current Q-value function. The Q-value function was represented by a multi-layer perceptron with 5 inputs, 2 hidden layers with 5 neurons each, and one output neuron, all equiped with sigmoidal activation functions. The weights of the network were randomly initialized within [ 0.5, 0.5]. After each episode, one loop of the NFQ algorithm was performed. The hint-to-goal heuristic was used with a factor of 100. For each transition, costs of c trans =0.01 were given. Results. Results for the cart-pole benchmark are shown in table 3. Performance is tested on 1000 testing episodes starting from randomly drawn initial states and having a maximum length of 3000 cycles. In the cartpole regulator benchmark, a controller is successful, if at the end of the episode, the pole is still upright and the cart is at its target position 0 within ±0.05m tolerance. Note that all the controllers that solve the regulator problem also solve the balancing problem. Typically, the balancing problem is solved much earlier than the regulator problem (figures not shown here).
11 NFQ-First Experiences with a Data Efficient Neural RL Method 327 Table 3. Results of NFQ on the cart-pole regulator benchmark. Training time was restricted to 500 episodes per trial. For an interpretation of the figures, see explanation at table for the mountain car benchmark. Cart Pole Regulator First successful policy episodes cycles interaction time costs average m49s best worst Best policy found (within 500 episodes) episodes cycles interaction time costs average m 36s best worst Again, training is done very efficiently. Although the control problem is challenging, a moderate amount of sample transitions - an average of cycles to find a successful policy and an average of cycles to find the best controller - are sufficient. This corresponds to an average real time of 5 minutes (or 10 minutes respectively for the best controller) that would be needed to do the collection of transition samples on a corresponding real system. To have a better feeling for the control performance of the learned controller, we analytically designed a linear controller for the cartpole regulator benchmark. We used a pole assignment method where we placed the poles of the closed loop system such that it was stable. Additionally, we tried to find parameters that produced control actions within the interval [ 10, +10] according to the above specification. The control law used was u = Rx,whereR =(30.61, 7.77, 0.45, 1.72) and x is the state vector. For the linear controller, the average number of cycles outside the goal region was over the 1000 test starting positions. The neural controllers that were learned had an average cost of 132.9, which means that they are about 3 times as fast as the linear controller. This is an even more remarkable result, if one considers, that no prior knowledge about plant behaviour was available to develop the neural policy. 6 Conclusion The paper proposes NFQ, a memory based method to train Q-value functions based on multi-layer perceptrons. By storing and reusing all transition experiences, the neural learning process can be made very data efficient and reliable. Additionally, by allowing for batch supervised learning in the core of adaptation, advanced supervised learning techniques can be applied that provide reliable and quick convergence of the supervised learning part of the problem. NFQ allows to exploit the positive effects of generalisation in multi-layer perceptrons while avoiding their negative effects of disturbing previously learned experiences.
12 328 M. Riedmiller The exploitation of generalisation leads to highly data efficient learning. This is shown in the three benchmarks performed. The amount of training experience required for learning successful policies is considerably low. The corresponding time for acquisition of the training data on a hypothetic real plant lies in the range of a few minutes for all three benchmarks performed. Forallthreebenchmarks,thesameneural network structure was successfully used. Of course, this does not mean, that we have found the one neural network that solves all control problems, but it is a positive hint with respect to the robustness of NFQ with respect to the choice of the underlying neural network. Robustness against the parametrisation of a method is of special importance for practical applications, since the search for sensitive parameters can be a resource consuming issue. References [BM95] Boyan and Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7. Morgan Kaufmann, [EPG05] D. Ernst and and L. Wehenkel P. Geurts. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6: , [Gor95] G. J. Gordon. Stable function approximation in dynamic programming. In A. Prieditis and S. Russell, editors, Proceedings of the ICML, San Francisco, CA, [Lin92] L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8: , [LP03] M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4: , [RB93] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In H. Ruspini, editor, Proceedings of the IEEE International Conference on Neural Networks (ICNN), pages , San Francisco, [Rie00] M. Riedmiller. Concepts and facilities of a neural reinforcement learning control architecture for technical process control. JournalofNeuralComputing and Application, 8: , [SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, Cambridge, MA, [Tes92] G. Tesauro. Practical issues in temporal difference learning. Machine Learning, (8): , 1992.
Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning
Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut
The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting
The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting Tim C. Kietzmann Neuroinformatics Group Institute of Cognitive Science University of Osnabrück [email protected] Martin
NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi
NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction
Chapter 4: Artificial Neural Networks
Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/
A Sarsa based Autonomous Stock Trading Agent
A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA [email protected] Abstract This paper describes an autonomous
Using Markov Decision Processes to Solve a Portfolio Allocation Problem
Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................
Reinforcement Learning
Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg [email protected]
Lecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
Reinforcement Learning
Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg [email protected]
Temporal Difference Learning in the Tetris Game
Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.
Recurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network
Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Qian Wu, Yahui Wang, Long Zhang and Li Shen Abstract Building electrical system fault diagnosis is the
Inductive QoS Packet Scheduling for Adaptive Dynamic Networks
Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria [email protected] Abdelhamid MELLOUK LISSI Laboratory University
TD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Abstract
The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy
BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.
PLAANN as a Classification Tool for Customer Intelligence in Banking
PLAANN as a Classification Tool for Customer Intelligence in Banking EUNITE World Competition in domain of Intelligent Technologies The Research Report Ireneusz Czarnowski and Piotr Jedrzejowicz Department
Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1
Generalization and function approximation CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Represent
An Introduction to Neural Networks
An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,
SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK
SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK N M Allinson and D Merritt 1 Introduction This contribution has two main sections. The first discusses some aspects of multilayer perceptrons,
CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER
93 CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER 5.1 INTRODUCTION The development of an active trap based feeder for handling brakeliners was discussed
Neuro-Dynamic Programming An Overview
1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents:
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems
A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems Jae Won Lee 1 and Jangmin O 2 1 School of Computer Science and Engineering, Sungshin Women s University, Seoul, Korea 136-742 [email protected]
Application of Neural Network in User Authentication for Smart Home System
Application of Neural Network in User Authentication for Smart Home System A. Joseph, D.B.L. Bong, D.A.A. Mat Abstract Security has been an important issue and concern in the smart home systems. Smart
Machine Learning: Multi Layer Perceptrons
Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons
SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS
UDC: 004.8 Original scientific paper SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS Tonimir Kišasondi, Alen Lovren i University of Zagreb, Faculty of Organization and Informatics,
Learning Tetris Using the Noisy Cross-Entropy Method
NOTE Communicated by Andrew Barto Learning Tetris Using the Noisy Cross-Entropy Method István Szita [email protected] András Lo rincz [email protected] Department of Information Systems, Eötvös
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
Package neuralnet. February 20, 2015
Type Package Title Training of neural networks Version 1.32 Date 2012-09-19 Package neuralnet February 20, 2015 Author Stefan Fritsch, Frauke Guenther , following earlier work
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network
Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network Dušan Marček 1 Abstract Most models for the time series of stock prices have centered on autoregressive (AR)
Artificial Neural Network for Speech Recognition
Artificial Neural Network for Speech Recognition Austin Marshall March 3, 2005 2nd Annual Student Research Showcase Overview Presenting an Artificial Neural Network to recognize and classify speech Spoken
Learning Agents: Introduction
Learning Agents: Introduction S Luz [email protected] October 22, 2013 Learning in agent architectures Performance standard representation Critic Agent perception rewards/ instruction Perception Learner Goals
A Reinforcement Learning Approach for Supply Chain Management
A Reinforcement Learning Approach for Supply Chain Management Tim Stockheim, Michael Schwind, and Wolfgang Koenig Chair of Economics, esp. Information Systems, Frankfurt University, D-60054 Frankfurt,
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Using artificial intelligence for data reduction in mechanical engineering
Using artificial intelligence for data reduction in mechanical engineering L. Mdlazi 1, C.J. Stander 1, P.S. Heyns 1, T. Marwala 2 1 Dynamic Systems Group Department of Mechanical and Aeronautical Engineering,
How I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
Predictive Q-Routing: A Memory-based Reinforcement Learning Approach to Adaptive Traffic Control
Predictive Q-Routing: A Memory-based Reinforcement Learning Approach to Adaptive Traffic Control Samuel P.M. Choi, Dit-Yan Yeung Department of Computer Science Hong Kong University of Science and Technology
COMBINED NEURAL NETWORKS FOR TIME SERIES ANALYSIS
COMBINED NEURAL NETWORKS FOR TIME SERIES ANALYSIS Iris Ginzburg and David Horn School of Physics and Astronomy Raymond and Beverly Sackler Faculty of Exact Science Tel-Aviv University Tel-A viv 96678,
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Monotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: [email protected] Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
Accurate and robust image superresolution by neural processing of local image representations
Accurate and robust image superresolution by neural processing of local image representations Carlos Miravet 1,2 and Francisco B. Rodríguez 1 1 Grupo de Neurocomputación Biológica (GNB), Escuela Politécnica
Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis
Contemporary Engineering Sciences, Vol. 4, 2011, no. 4, 149-163 Performance Evaluation of Artificial Neural Networks for Spatial Data Analysis Akram A. Moustafa Department of Computer Science Al al-bayt
Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승
Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승 How much energy do we need for brain functions? Information processing: Trade-off between energy consumption and wiring cost Trade-off between energy consumption
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
NEURAL NETWORKS A Comprehensive Foundation
NEURAL NETWORKS A Comprehensive Foundation Second Edition Simon Haykin McMaster University Hamilton, Ontario, Canada Prentice Hall Prentice Hall Upper Saddle River; New Jersey 07458 Preface xii Acknowledgments
Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model
Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model Iman Attarzadeh and Siew Hock Ow Department of Software Engineering Faculty of Computer Science &
SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION
SIGLE-STAGE MULTI-PRODUCT PRODUCTIO AD IVETORY SYSTEMS: A ITERATIVE ALGORITHM BASED O DYAMIC SCHEDULIG AD FIXED PITCH PRODUCTIO Euclydes da Cunha eto ational Institute of Technology Rio de Janeiro, RJ
Multiple Layer Perceptron Training Using Genetic Algorithms
Multiple Layer Perceptron Training Using Genetic Algorithms Udo Seiffert University of South Australia, Adelaide Knowledge-Based Intelligent Engineering Systems Centre (KES) Mawson Lakes, 5095, Adelaide,
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Effect of Using Neural Networks in GA-Based School Timetabling
Effect of Using Neural Networks in GA-Based School Timetabling JANIS ZUTERS Department of Computer Science University of Latvia Raina bulv. 19, Riga, LV-1050 LATVIA [email protected] Abstract: - The school
A Multi-level Artificial Neural Network for Residential and Commercial Energy Demand Forecast: Iran Case Study
211 3rd International Conference on Information and Financial Engineering IPEDR vol.12 (211) (211) IACSIT Press, Singapore A Multi-level Artificial Neural Network for Residential and Commercial Energy
Data Structures and Algorithms
Data Structures and Algorithms Computational Complexity Escola Politècnica Superior d Alcoi Universitat Politècnica de València Contents Introduction Resources consumptions: spatial and temporal cost Costs
Lecture 5: Model-Free Control
Lecture 5: Model-Free Control David Silver Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary Introduction Model-Free Reinforcement
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
Neural Network Design in Cloud Computing
International Journal of Computer Trends and Technology- volume4issue2-2013 ABSTRACT: Neural Network Design in Cloud Computing B.Rajkumar #1,T.Gopikiran #2,S.Satyanarayana *3 #1,#2Department of Computer
Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea
STOCK PRICE PREDICTION USING REINFORCEMENT LEARNING Jae Won idee School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea ABSTRACT Recently, numerous investigations
Automatic Inventory Control: A Neural Network Approach. Nicholas Hall
Automatic Inventory Control: A Neural Network Approach Nicholas Hall ECE 539 12/18/2003 TABLE OF CONTENTS INTRODUCTION...3 CHALLENGES...4 APPROACH...6 EXAMPLES...11 EXPERIMENTS... 13 RESULTS... 15 CONCLUSION...
Euler: A System for Numerical Optimization of Programs
Euler: A System for Numerical Optimization of Programs Swarat Chaudhuri 1 and Armando Solar-Lezama 2 1 Rice University 2 MIT Abstract. We give a tutorial introduction to Euler, a system for solving difficult
IBM SPSS Neural Networks 22
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
INTELLIGENT ENERGY MANAGEMENT OF ELECTRICAL POWER SYSTEMS WITH DISTRIBUTED FEEDING ON THE BASIS OF FORECASTS OF DEMAND AND GENERATION Chr.
INTELLIGENT ENERGY MANAGEMENT OF ELECTRICAL POWER SYSTEMS WITH DISTRIBUTED FEEDING ON THE BASIS OF FORECASTS OF DEMAND AND GENERATION Chr. Meisenbach M. Hable G. Winkler P. Meier Technology, Laboratory
The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network
, pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Power Prediction Analysis using Artificial Neural Network in MS Excel
Power Prediction Analysis using Artificial Neural Network in MS Excel NURHASHINMAH MAHAMAD, MUHAMAD KAMAL B. MOHAMMED AMIN Electronic System Engineering Department Malaysia Japan International Institute
Nonparametric adaptive age replacement with a one-cycle criterion
Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: [email protected]
Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm
1 Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm Hani Mehrpouyan, Student Member, IEEE, Department of Electrical and Computer Engineering Queen s University, Kingston, Ontario,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
3 An Illustrative Example
Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent
Follow links Class Use and other Permissions. For more information, send email to: [email protected]
COPYRIGHT NOTICE: David A. Kendrick, P. Ruben Mercado, and Hans M. Amman: Computational Economics is published by Princeton University Press and copyrighted, 2006, by Princeton University Press. All rights
An Environment Model for N onstationary Reinforcement Learning
An Environment Model for N onstationary Reinforcement Learning Samuel P. M. Choi Dit-Yan Yeung Nevin L. Zhang pmchoi~cs.ust.hk dyyeung~cs.ust.hk lzhang~cs.ust.hk Department of Computer Science, Hong Kong
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
ELLIOTT WAVES RECOGNITION VIA NEURAL NETWORKS
ELLIOTT WAVES RECOGNITION VIA NEURAL NETWORKS Martin Kotyrba Eva Volna David Brazina Robert Jarusek Department of Informatics and Computers University of Ostrava Z70103, Ostrava, Czech Republic [email protected]
Lecture 1: Introduction to Reinforcement Learning
Lecture 1: Introduction to Reinforcement Learning David Silver Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning
Confidence Intervals for the Difference Between Two Means
Chapter 47 Confidence Intervals for the Difference Between Two Means Introduction This procedure calculates the sample size necessary to achieve a specified distance from the difference in sample means
degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].
1.3 Neural Networks 19 Neural Networks are large structured systems of equations. These systems have many degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. Two very
TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC
777 TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC M.R. Walker. S. Haghighi. A. Afghan. and L.A. Akers Center for Solid State Electronics Research Arizona State University Tempe. AZ 85287-6206 [email protected]
Adaptive Fraud Detection Using Benford s Law
Adaptive Fraud Detection Using Benford s Law Fletcher Lu 1, J. Efrim Boritz 2, and Dominic Covvey 2 1 Canadian Institute of Chartered Accountants, 66 Grace Street, Scarborough, Ontario, M1J 3K9 [email protected]
The CHIR Algorithm for Feed Forward. Networks with Binary Weights
516 Grossman The CHIR Algorithm for Feed Forward Networks with Binary Weights Tal Grossman Department of Electronics Weizmann Institute of Science Rehovot 76100 Israel ABSTRACT A new learning algorithm,
