Reinforcement Learning

Similar documents

Reinforcement Learning

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

6.231 Dynamic Programming and Stochastic Control Fall 2008

Optimization under uncertainty: modeling and solution methods

Stochastic Inventory Control

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

Scheduling Software Projects to Minimize the Development Time and Cost with a Given Staff

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

Neuro-Dynamic Programming An Overview

Notes from Week 1: Algorithms for sequential prediction

Factors to Describe Job Shop Scheduling Problem

Stochastic Models for Inventory Management at Service Facilities

6.231 Dynamic Programming Midterm, Fall Instructions

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

LECTURE 4. Last time: Lecture outline

Coding and decoding with convolutional codes. The Viterbi Algor

10.2 Series and Convergence

Dynamic Programming 11.1 AN ELEMENTARY EXAMPLE

TD(0) Leads to Better Policies than Approximate Value Iteration

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

How To Optimize A Multi-Echelon Inventory System

Scheduling Algorithms for Downlink Services in Wireless Networks: A Markov Decision Process Approach

Supply planning for two-level assembly systems with stochastic component delivery times: trade-off between holding cost and service level

An Environment Model for N onstationary Reinforcement Learning

Load Balancing and Switch Scheduling

Performance Analysis of a Telephone System with both Patient and Impatient Customers

Chapter 2: Binomial Methods and the Black-Scholes Formula

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

5 Directed acyclic graphs

Analysis of Algorithms I: Optimal Binary Search Trees

Scheduling Shop Scheduling. Tim Nieberg

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Creating a NL Texas Hold em Bot

Dynamic programming. Doctoral course Optimization on graphs - Lecture 4.1. Giovanni Righini. January 17 th, 2013

An Introduction to Markov Decision Processes. MDP Tutorial - 1

A Single-Unit Decomposition Approach to Multi-Echelon Inventory Systems

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Scheduling Single Machine Scheduling. Tim Nieberg

2.3 Convex Constrained Optimization Problems

Lecture notes: single-agent dynamics 1

A control Lyapunov function approach for the computation of the infinite-horizon stochastic reach-avoid problem

Finitely Additive Dynamic Programming and Stochastic Games. Bill Sudderth University of Minnesota

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

Optimal Design of Sequential Real-Time Communication Systems Aditya Mahajan, Member, IEEE, and Demosthenis Teneketzis, Fellow, IEEE

statistical learning; Bayesian learning; stochastic optimization; dynamic programming

Chapter 4 Lecture Notes

1 Portfolio Selection

PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation

ALGORITHMIC TRADING WITH MARKOV CHAINS

Security Risk Management via Dynamic Games with Learning

Single item inventory control under periodic review and a minimum order quantity

M/M/1 and M/M/m Queueing Systems

Course: Model, Learning, and Inference: Lecture 5

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

VENDOR MANAGED INVENTORY

Two-Stage Stochastic Linear Programs

Numerical methods for American options

CHAPTER 1. Basic Concepts on Planning and Scheduling

How I won the Chess Ratings: Elo vs the rest of the world Competition

Lecture 5: Model-Free Control

A Sarsa based Autonomous Stock Trading Agent

INTEGRATED OPTIMIZATION OF SAFETY STOCK

Markov Decision Processes for Ad Network Optimization

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

Binomial lattice model for stock prices

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

LOGISTIQUE ET PRODUCTION SUPPLY CHAIN & OPERATIONS MANAGEMENT

Determining the Direct Mailing Frequency with Dynamic Stochastic Programming

Optimal proportional reinsurance and dividend pay-out for insurance companies with switching reserves

Fairness in Routing and Load Balancing

Offline sorting buffers on Line

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Version Spaces.

Feature Selection with Monte-Carlo Tree Search

Application of Markov chain analysis to trend prediction of stock indices Milan Svoboda 1, Ladislav Lukáš 2

ROLLING HORIZON PROCEDURES FOR THE SOLUTION OF AN OPTIMAL REPLACEMENT

Material Requirements Planning MRP

3.2 Roulette and Markov Chains

How Asymmetry Helps Load Balancing

A Programme Implementation of Several Inventory Control Algorithms

Follow the Perturbed Leader

ST 371 (IV): Discrete Random Variables

Risk Management for IT Security: When Theory Meets Practice

8.1 Min Degree Spanning Tree

The Ergodic Theorem and randomness

Project Scheduling: PERT/CPM

3. Regression & Exponential Smoothing

Estimating an ARMA Process

6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Largest Fixed-Aspect, Axis-Aligned Rectangle

Systems of Linear Equations

Machine Learning: Multi Layer Perceptrons

A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Transcription:

Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg martin.lauer@kit.edu Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (1)

LU 2: Markov Decision Problems and DP Goals: Definition of Markov Decision Problems (MDPs) Introduction to Dynamic Programming (DP) Outline short review definition of MDPs DP: principle of optimality the DP algorithm (backward DP) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (2)

Review Process, can be influenced by actions Agent: Sensory input, output of action Feedback RL: Training information through evaluation only Delayed Reinforcement Learning: Decision, decision, decision,... evaluation Multi-stage decision process Optimization Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (3)

The Agent Concept Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (4)

Multi-stage decision problems Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (5)

Three components System, process Rewards, costs Policy, strategy Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (6)

Requirements for the model Goal: Describing the system s behaviour (also a system: Process, world, environment) requirements for a model: situations activities current situation can be influenced adjustments possible at discrete points in time noise, interference, random goal specification: definition of costs / rewards Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (7)

System description Discrete decision points t T = {0, 1,..., N} or (stages) T = {0, 1,... } System state (situation) s t S here: S finite Actions u t U here: U finite Transition function s t+1 = f (s t, u t) reaction of the system Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (8)

Goal formulation: Introducing costs At every decision (= in every stage) direct costs arise Direct costs Refinement: dependant on state and action c : S R c : S U R Reward, cost, punishment? Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (9)

Summary: Deterministic systems discrete decision points t T = {0, 1,..., N} or stages T = {0, 1,... } system state (situation) actions s t S u t U transition function s t+1 = f (s t, u t) direct costs c : S U R 5-tuple (T, S, U, f, c) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (10)

Example: Shortest path problems Find the shortest path from start node to finish node. Every edge has a specific cost that can be interpreted as length. Optimization goal over multiple stages Evaluation of whole sequence (reminder: decision, decision,... evaluation) Look at accumulated total costs: t T c(st, ut) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (11)

Stochastic systems Again: requirements for a model: situations activities current situation can be influenced adjustments possible at discrete points in time noise, interference,random goal specification: definition of costs / rewards Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (12)

Markov Decision Processes Deterministic system: 5-Tuple (T, S, U, f, c) Stochastic system: The deterministic transition function f is replaced by a conditional probability distribution. In the following, we re looking at a finite state set S = (1, 2,..., N). Let i, j S be states: Notation: Markov Decision Process (MDP): 5-Tuple (T, S, U, p ij (u), c(s, u)) P(s t+1 = j s t = i, u t = u) = p ij (u) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (13)

Markov property It holds that: P(s t+1 = j s t, u t) = P(s t+1 = j s t, s t 1,..., u t, u t 1,...) The probability distribution of the following state s t+1 is uniquely defined given the knowledge of the current state s t and the action u t. It especially does not depend on the previous history of the system. Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (14)

Remarks (1) Deterministic system is a special case of an MDP: { 1, st+1 = f (s t, u t) P(s t+1 s t, u t) = 0, otherwise Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (15)

Remarks (2) Equivalent description with deterministic transition function f : Approach: additional argument - random variable w t (noise): s t+1 = f (s t, u t, w t) with w t random variable with given probability distribution P(w t s t, u t) Transformation into previous form: Let W (i, u, j) = {w j = f (i, u, w)} be the set of all values of w, for which the system transitions from state i on input of u into state j. Then it holds: p ij (u) = P(w W (i, u, j)) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (16)

Summary: MDPs discrete decision points t T = {0, 1,..., N} or stages T = {0, 1,... } system state (situation) actions transition probabilites p ij (u) s t S u t U P(s t+1 = j s t = i, u t = u) = p ij (u) alternatively: Transition function s t+1 = f (s t, u t, w t) with w t random variable direct costs 5-tuple (T, S, U, p ij (u), c(s, u)) c : S U R Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (17)

Summary: MDPs Model: State, action, following state Deterministic and stochastic transition function Information about history summarized in state Very general description: OR, control engineering, games,... Generalizations (not covered here) Transition function not stationary p ij,t (u) Costs not stationary c t(i, u) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (18)

Example stock keeping Assume you are the owner of a toys shop at an exhibition. Exhibition lasts N days. state: number of toys in your shop action: ordered number of toys to be delivered on the next day disturbance : number of toys sold s t u t w t system equation: costs for toys in stock acquisition costs for each toy which was ordered minus gain for sold toys s t+1 = s t + u t w t c(s, u) = c 1(s) + c 2(u) gain there are also terminal costs g(s), if there are still toys in stock after the N days. Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (19)

Policy and selection function Policy: The selection function π t : S U, π t(s) = u chooses at time t an action u U as function of the current state s S. Selection function chooses an action in dependence of the situation (see graphic agent ) Refinement: π t : S U, π t(s) = u, with u U(s) situation dependent action set (example: chess) A policy ˆπ consists of N selection functions (N being the number of decision points) ˆπ = (π 0, π 1,..., π t,...) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (20)

Non-stationary policies The selection function π t can be dependent on the time of the decision. Meaning: The same situation at different points in time can lead to different decisions of the agent. ˆπ = (π 0, π 1,..., π t,...) If the selection functions differ for single time points, we call it a non-stationary policy. Example soccer: Situation s: Midfield player has the ball. Reasonable action in the first minute: π 1(s) = return pass Reasonable action in the last minute: π 90(s) = shoot on goal General rationale: The limited optimization time frame ( finite horizon, see below) usually requires a non-stationary policy! Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (21)

Stationary policies We will look mostly at stationary policies. Then it holds that π 0 = π 1 =... π t... =: π and ˆπ = (π, π,..., π,...) With stationary policies, the terms policy and selection function become interchangeable. We will call the selection function π - as generally done in literature - our policy. Bertsekas uses the term µ for the selection function. Therefore there arise minor differences from the notation used there. Remark: In the following only deterministic selection functions will be used Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (22)

Goal of the policy Reach the optimization goal over multiple stages (sequence of decisions) Solving a dynamic optimization problem Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (23)

Cumulated costs (costs-to-go) Interesting: Cumulated costs for a given state s with given policy π: J π (s) = t T c(s t, π(s t)), s 0 = s Wanted: Optimal policy π so that for all s it holds that: J π (s) = min c(s t, π(s t)), π ˆπ t T under the constraint that s t+1 = f (s t, u t) s 0 = s Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (24)

Cumulated costs in MDPs Expected cumulated costs for a given state s using a given policy π: J π (s) = E w c(s t, π(s t)), s 0 = s t T Wanted: Optimal policy π so that for all s it holds that: J π (s) = min π Π Ew t T c(s t, π(s t)), s 0 = s under the constraint that s t+1 = f (s t, u t, w t), or with given probability distribution P(s t+1 = j s t = i, u t = u) = p ij (u) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (25)

Problem types Definition horizon: The horizon N of a problem denotes the number of decision stages to be traversed. Finite horizon: Problems with given termination time Infinite horizon: Approximation for very long processes or processes with an unknown end (e.g. control system) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (26)

Finite horizon N-stage decision problem Each state has terminal costs g(i) that are due if the system ends in i after N stages. Costs of a policy π N 1 JN π (s) = E[g(s N ) + c(s t, π t(s t)) s 0 = s] Generally: Non-stationary policy t=0 Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (27)

Infinite horizon Costs of a policy π Problem: Finite costs? J π (s) = lim E[ N c(s t, π t(s t)) s 0 = s] N Solution: Discount α < 1 J π (s) = lim E[ N α t c(s t, π t(s t)) s 0 = s] N t=0 t=0 Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (28)

Solution of dynamic optimization problems Central question: How do we find the policy that leads (on average) to minimal costs? Remark: We can formulate this analogously as a maximization problem (e.g. maximizing the gain). Solution method: Dynamic Programming (Bellman, 1957) Backward Dynamic Programming Value Iteration (LU 3 ff.) Policy Iteration (LU 3 ff.) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (29)

Backward Dynamic Programming - idea Problem: Stochastic multistage decision problems with finite horizon Idea: Calculate the costs starting from the last stage to the first stage. Example: Find the shortest path in a graph Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (30)

Backward Dynamic Programming - problem specification (1) finite horizon N MDP N discrete decision points t T = {0, 1,..., N} State set finite s t S = {1, 2,..., n} Action set finite u t U = {u 1,..., u m} Transition prob. p ij (u) P(s t+1 = j s t = i, u t = u) = p ij (u) direct costs c : S U R in the last stage N every stage causes terminal costs g(s N ) := c N (s N ) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (31)

Backward Dynamic Programming - objective Wanted: π with J π = min π J π with JN π (i) = E[g(s N ) + N 1 t=0 c(st, πt(st)) s0 = i] the costs belonging to π are called the optimal cumulated costs J := J π. Approach: 1. Calcuation of optimal cumulated costs ( cost-to-go ) J k ( ) for all states (J k ( ) is a n dimensional vector). k is the number of remaining steps. 2. from J k follows the optimal policy for the k step problem. (k steps until process terminates). Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (32)

Backward Dynamic Programming - motivation Thesis - Bellman s Principle of Optimality: If I have k more steps to go, the optimal costs for a state i are given with the minimal expected value of the sum of the direct transition costs + optimal cumulated costs of the next state, if there are k 1 more steps to be done from there. The minimization here goes over all possible actions Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (33)

Bellman s Principle of Optimality Formal: For the optimal cumulated costs J k (i) of the k-stage decision problem, it holds that: Jk (i) = min u U(i) Ew {c(i, u) + k J k 1(f (i, u, w k )} n = min u U(i) {p ij (u)(c(i, u) + Jk 1(j))} i = 1... n (1) j=1 Hence we can calculate the optimal cumulated costs of the N stage optimization problem recursively starting with k = 0. Backward-DP algorithm Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (34)

Bellman s Principle of Optimality - proof (1) Policy ˆπ (k) for k stages: ˆπ (k) = (π k, π k 1, π k 2,...) = (π k, ˆπ (k 1) ) Let S (k) (i) = (s N k = i, s (N k)+1,..., s N ) be a possible state sequence starting in state i with k transitions. J k (i) = min J ˆπ (k) ˆπ (k) k = min ˆπ (k) { S (k) (i) (i) (P(S (k) (i) ˆπ (k) )( = min ˆπ (k) {c(i, π k (i)) + S (k) (i) k c(s N l, π l (s N l )) + g(s N )))} l=1 (P(S (k) (i) ˆπ (k) ) k 1 ( c(s N l, π l (s N l )) + g(s N )))} l=1 (2) (3) (4) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (35)

Bellman s Principle of Optimality - proof (2) = min ˆπ (k) {c(i, π k (i)) + S (k) (i) (P(S (k) (i) ˆπ (k) ) k 1 ( c(s N l, π l (s N l )) + g(s N )))} l=1 = min{c(i, π k (i)) + P(s (N k)+1 = j s N k = i, π k ) ˆπ (k) j S (5) S (k 1) (j) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} l=1 (6) (7) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (36)

Bellman s Principle of Optimality - proof (3) = min{c(i, π k (i)) + P(s (N k)+1 = j s N k = i, π k ) ˆπ (k) j S S (k 1) (j) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S l=1 k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } l=1 (8) (9) (10) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (37)

Bellman s Principle of Optimality - proof (4) = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S = min { c(i, u) + u U(i) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } j S l=1 P(s (N k)+1 = j s N k = i, u) min {J ˆπ (k 1) ˆπ (k 1) = min {c(i, u) + P(s (N k)+1 = j s N k = i, u) Jk 1(j)} u U(i) j S (11) k 1 (j)} } (12) = min {c(i, u) + p ij (u) Jk 1(j)} u U(i) j S (13) (14) (15) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (38)

Backward Dynamic Programming - algorithm k = 0: For k = 1 To N, i S J 0 (i) = g(i) or Jk (i) = min E wk {c(i, u) + Jk 1(f (i, u, w k ))} u U(i) J k (i) = min u U(i) n p ij (u)(c(i, u) + Jk 1(j)) j=1 Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (39)

Choosing an action Requirement: Jk (i) is known for all k N. Approach: We simply calculate for all possible actions the expected costs and choose the best action (with minimal expected cumulated costs). π k (i) arg min u U(i) E wk {c(i, u) + J k 1(f (i, u, w k )) the chosen optimal action minimizes the sum of the expected transition costs plus the expected cumulated costs of the remaining problem. Remark: J k defines an optimal policy The policy is not unique, but J k is Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (40)

Remarks Complexity for deterministic systems O(N n m) Complexity for stochastic systems O(N n 2 m) Exact solution rarely computable, numeric solution; but: very complex! (N = number of stages, n = number of states, m = number of actions) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (41)