Q1. [19 pts] Cheating at cs188-blackjack

Similar documents
Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

4/1/2017. PS. Sequences and Series FROM 9.2 AND 9.3 IN THE BOOK AS WELL AS FROM OTHER SOURCES. TODAY IS NATIONAL MANATEE APPRECIATION DAY

Lecture 5: Model-Free Control

Chapter 31 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M.

6.3 Conditional Probability and Independence

Math Workshop October 2010 Fractions and Repeating Decimals

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

IB Maths SL Sequence and Series Practice Problems Mr. W Name

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

Decision Making under Uncertainty

Reinforcement Learning

Chapter 4 Lecture Notes

Reinforcement Learning

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Question 2 Naïve Bayes (16 points)

AK 4 SLUTSKY COMPENSATION

Probability definitions

Current California Math Standards Balanced Equations

1 Interest rates, and risk-free investments

MICROECONOMIC PRINCIPLES SPRING 2001 MIDTERM ONE -- Answers. February 16, Table One Labor Hours Needed to Make 1 Pounds Produced in 20 Hours

Handout #1: Mathematical Reasoning

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

A Few Basics of Probability

Lab 11. Simulations. The Concept

1 if 1 x 0 1 if 0 x 1

10.2 Series and Convergence

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Revised Version of Chapter 23. We learned long ago how to solve linear congruences. ax c (mod m)

Bayesian Tutorial (Sheet Updated 20 March)

VALUE %. $100, (=MATURITY

Degeneracy in Linear Programming

Compression algorithm for Bayesian network modeling of binary systems

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Baltic Way Västerås (Sweden), November 12, Problems and solutions

Logarithmic and Exponential Equations

Notes on Complexity Theory Last updated: August, Lecture 1

OA4-13 Rounding on a Number Line Pages 80 81

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Solutions to old Exam 1 problems

To discuss this topic fully, let us define some terms used in this and the following sets of supplemental notes.

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

CONTENTS. Please note:

Lecture 3: Finding integer solutions to systems of linear equations

33 DISCOUNTED PRESENT VALUE

Chapter 21: The Discounted Utility Model

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

E3: PROBABILITY AND STATISTICS lecture notes

Introductory Probability. MATH 107: Finite Mathematics University of Louisville. March 5, 2014

Contingency Tables and the Chi Square Statistic. Interpreting Computer Printouts and Constructing Tables

1. The Fly In The Ointment

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

OPRE 6201 : 2. Simplex Method

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Unit 19: Probability Models

Ready, Set, Go! Math Games for Serious Minds

Math Circle Beginners Group October 18, 2015

A Sarsa based Autonomous Stock Trading Agent

Bonus Maths 2: Variable Bet Sizing in the Simplest Possible Game of Poker (JB)

ALGEBRA 2/TRIGONOMETRY

Solutions to Math 51 First Exam January 29, 2015

1 Maximum likelihood estimation

ECE302 Spring 2006 HW1 Solutions January 16,

5544 = = = Now we have to find a divisor of 693. We can try 3, and 693 = 3 231,and we keep dividing by 3 to get: 1

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

A Second Course in Mathematics Concepts for Elementary Teachers: Theory, Problems, and Solutions

An Introduction to Markov Decision Processes. MDP Tutorial - 1

6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

Using Proportions to Solve Percent Problems I

WHERE DOES THE 10% CONDITION COME FROM?

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

3515ICT Theory of Computation Turing Machines

Investment Appraisal INTRODUCTION

Section 1.1 Linear Equations: Slope and Equations of Lines

Copyrighted Material. Chapter 1 DEGREE OF A CURVE

Polynomial and Rational Functions

5 Systems of Equations

On Black-Scholes Equation, Black- Scholes Formula and Binary Option Price

Towards running complex models on big data

Chapter 11 Number Theory

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Math 55: Discrete Mathematics

Midterm I. 1 Multiple-choice Questions (30 points) Economics 110 Spring 2005 Tanya Rosenblat. Name: Section:

GEOMETRIC SEQUENCES AND SERIES

Homework 8 Solutions

Teaching Pre-Algebra in PowerPoint

TD(0) Leads to Better Policies than Approximate Value Iteration

6.080/6.089 GITCS Feb 12, Lecture 3

3.2 Roulette and Markov Chains

CHAPTER 2 Estimating Probabilities

Grade 7/8 Math Circles Sequences and Series

Patterns in Pascal s Triangle

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series

Unit 6 Number and Operations in Base Ten: Decimals

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

Iterated Dynamic Belief Revision. Sonja Smets, University of Groningen. website:

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Numerical Methods for Option Pricing

Transcription:

CS 188 Spring 2012 Introduction to Artificial Intelligence Practice Midterm II Solutions Q1. [19 pts] Cheating at cs188-blackjack Cheating dealers have become a serious problem at the cs188-blackjack tables. A cs188-blackjack deck has 3 card types (5,10,11) and an honest dealer is equally likely to deal each of the 3 cards. When a player holds 11, cheating dealers deal a 5 with probability 1 4, 10 with probability 1 2, and 11 with probability 1 4. You estimate that 4 5 of the dealers in your casino are honest (H) while 1 5 are cheating ( H). Note: You may write answers in the form of arithmetic expressions involving numbers or references to probabilities that have been directly specified, are specified in your conditional probability tables below, or are specified as answers to previous questions. (a) [3 pts] You see a dealer deal an 11 to a player holding 11. What is the probability that the dealer is cheating? P ( H D = 11) = P ( H, D = 11) P (D = 11) = P (D = 11 H)P ( H) P (D = 11 H)P ( H) + P (D = 11 H)P (H) = (1/4)(2/10) (1/4)(2/10) + (1/3)(8/10) = 3/19 The casino has decided to install a camera to observe its dealers. Cheating dealers are observed doing suspicious things on camera (C) 4 5 of the time, while honest dealers are observed doing suspicious things 1 4 of the time. (b) [4 pts] Draw a Bayes net with the variables H (honest dealer), D (card dealt to a player holding 11), and C (suspicious behavior on camera). Write the conditional probability tables. H D C H P (H) 1 0.8 0 0.2 H D P (D H) 1 5 1/3 1 10 1/3 1 11 1/3 0 5 1/4 0 10 1/2 0 11 1/4 1

H C P (C H) 1 1 1/4 1 0 3/4 0 1 4/5 0 0 1/5 (c) [2 pts] List all conditional independence assertions made by your Bayes net. D C H Common mistakes: Stating that two variables are NOT independent; Bayes nets do not guarantee that variables are dependent. This can only be verified by examining the exact probability distributions. (d) [4 pts] What is the probability that a dealer is honest given that he deals a 10 to a player holding 11 and is observed doing something suspicious? P (H D = 10, C) = P (H, D = 10, C) P (D = 10, C) = P (H)P (D = 10 H)P (C H) P (H)P (D = 10 H)P (C H) + P ( H)P (D = 10 H)P (C H) = (4/5)(1/3)(1/4) (4/5)(1/3)(1/4) + (1/5)(1/2)(4/5) = 5 11 Common mistakes: -1 for not giving the proper form of Bayes rule, P (H D = 10, C) = P (H, D = 10, C)/P (D = 10, C) -1 For a correctly drawn Bayes net C and D are not independent, which means that P (D = 10, C) P (D = 10)P (C) You can either arrest dealers or let them continue working. If you arrest a dealer and he turns out to be cheating, you will earn a $4 bonus. However, if you arrest the dealer and he turns out to be innocent, he will sue you for -$10. Allowing the cheater to continue working will cost you -$2, while allowing an honest dealer to continue working will get you $1. Assume a linear utility function U(x) = x. (e) [3 pts] You observe a dealer doing something suspicious (C) and also observe that he deals a 10 to a player holding 11. Should you arrest the dealer? Arresting the dealer yields an expected payoff of 4 P ( H D = 10, C) + ( 10) P (H D = 10, C) = 4(6/11) + ( 10)(5/11) = 26/11 Letting him continue working yields a payoff of ( 2) P ( H D = 10, C) + 1 P (H D = 10, C) = ( 2)(6/11) + (1)(5/11) = 7/11 Therefore, you should let the dealer continue working. 2

(f) [3 pts] A private investigator approaches you and offers to investigate the dealer from the previous part. If you hire him, he will tell you with 100% certainty whether the dealer is cheating or honest, and you can then make a decision about whether to arrest him or not. How much would you be willing to pay for this information? If you hire the private investigator, if the dealer is a cheater you can arrest him for a payoff of $4. If he is an honest dealer you can let him continue working for a payoff of $1. The benefit from hiring the investigator is therefore (4) P ( H D = 10, C) + 1 P (H D = 10, C) = 4(6/11) + (1)(5/11) = 29/11 If you do not hire him, your best course of action is to let the dealer continue working for an expected payoff of 7/11. Therefore, you are willing to pay up to 29/11 ( 7/11) = 36/11 to hire the investigator. 3

Q2. [20 pts] MDPs: Grid-World Water Park Consider the MDP drawn below. The state space consists of all squares in a grid-world water park. There is a single waterslide that is composed of two ladder squares and two slide squares (marked with vertical bars and squiggly lines respectively). An agent in this water park can move from any square to any neighboring square, unless the current square is a slide in which case it must move forward one square along the slide. The actions are denoted by arrows between squares on the map and all deterministically move the agent in the given direction. The agent cannot stand still: it must move on each time step. Rewards are also shown below: the agent feels great pleasure as it slides down the water slide (+2), a certain amount of discomfort as it climbs the rungs of the ladder (-1), and receives rewards of 0 otherwise. The time horizon is infinite; this MDP goes on forever. (a) [2 pts] How many (deterministic) policies π are possible for this MDP? 2 11 (b) [9 pts] Fill in the blank cells of this table with values that are correct for the corresponding function, discount, and state. Hint: You should not need to do substantial calculation here. γ s = A s = E V3 (s) 1.0 0 4 V10(s) 1.0 2 4 V10(s) 0.1 0 2.2 Q 1(s, left) 1.0 0 Q 10(s, left) 1.0 3 V (s) 1.0 V (s) 0.1 0 2.2 4

Use this labeling of the state space to complete the remaining subproblems: (c) [5 pts] Fill in the blank cells of this table with the Q-values that result from applying the Q-update for the transition specified on each row. You may leave Q-values that are unaffected by the current update blank. Use discount γ = 1.0 and learning rate α = 0.5. Assume all Q-values are initialized to 0. (Note: the specified transitions would not arise from a single episode.) Q(D, left) Q(D, right) Q(E, left) Q(E, right) Initial: 0 0 0 0 Transition 1: (s = D, a = right, r = 1, s = E) -0.5 Transition 2: (s = E, a = right, r = +2, s = F ) 1.0 Transition 3: (s = E, a = left, r = 0, s = D) Transition 4: (s = D, a = right, r = 1, s = E) -0.25 The agent is still at the water park MDP, but now we re going to use function approximation to represent Q-values. Recall that a policy π is greedy with respect to a set of Q-values as long as a, s Q(s, π(s)) Q(s, a) (so ties may be broken in any way). For the next subproblem, consider the following feature functions: 5

f(s, a) = f (s, a) = { 1 if a = right, 0 otherwise. { 1 if (a = right) isslide(s), 0 otherwise. (Note: isslide(s) is true iff the state s is a slide square, i.e. either F or G.) Also consider the following policies: (d) [4 pts] Which are greedy policies with respect to the Q-value approximation function obtained by running the single Q-update for the transition (s = F, a = right, r = +2, s = G) while using the specified feature function? You may assume that all feature weights are zero before the update. Use discount γ = 1.0 and learning rate α = 1.0. Circle all that apply. f π 1 π 2 π 3 f π 1 π 2 π 3 6

Q3. [17 pts] Bayes Nets Suppose that a patient can have a symptom (S) that can be caused by two different diseases (A and B). Disease A is much rarer, but there is a test T that tests for the presence of A. The Bayes Net and corresponding conditional probability tables for this situation are shown below. For each part, you may leave your answer as a fraction. P(A) +a 0.1 a 0.9 T A S B P(B) +b 0.5 b 0.5 P(T A) +a +t 1 +a t 0 a +t 0.2 a t 0.8 (a) [1 pt] Compute the following entry from the joint distribution: P( a, t, +b, +s) = P( t a)p( a)p(+s + b, a)p(+b) = (0.8)(0.9)(1)(0.5) = 0.36 P(S A, B) +a +b +s 1 +a +b s 0 +a b +s 0.80 +a b s 0.20 a +b +s 1 a +b s 0 a b +s 0 a b s 1 (b) [2 pts] What is the probability that a patient has disease A given that they have disease B? P(+a + b) = P(+a) = 0.1 (c) [2 pts] What is the probability that a patient has disease A given that they have symptom S, disease B, and test T returns positive? P(+a + t, +s, +b) = = P(+a,+t,+s,+b) P(+t,+s,+b) = P(+a)P(+t +a)p(+s +a,+b)p(+b) a {+a, a} P(a,+t,+s,+b) (0.1)(1)(1)(0.5) (0.1)(1)(1)(0.5)+(0.9)(0.2)(1)(0.5) =.05.05+.09 = 5 14 0.357 (d) [3 pts] What is the probability that a patient has disease A given that they have symptom S and test T returns positive? P(+a + t, +s) = P(+a,+t,+s) P(+t,+s) b {+b, b} P(+a)P(+t +a)p(+s +a,b)p(b) b {+b, b} a {+a, a} P(a)P(+t a)p(+s a,b)p(b) = 0.5 7

(e) [3 pts] Suppose that both diseases A and B become more likely as a person ages. Add any necessary variables and/or arcs to the Bayes net to represent this change. For any variables you add, briefly (one sentence or less) state what they represent. Also, state one independence or conditional independence assertion that is removed due to your changes. New variable(s) (and brief definition): G Removed conditional independence assertion: There are a few. The simplest is A B is no longer guaranteed. Another is B T. A B T S (f) [6 pts] Based only on the structure of the (new) Bayes Net given below, circle whether the following conditional independence assertions are guaranteed to be true, guaranteed to be false, or cannot be determined by the structure alone. U V W X Y Z 1. V W Guaranteed true Cannot be determined Guaranteed false 2. V W U Guaranteed true Cannot be determined Guaranteed false 3. V W U, Y Guaranteed true Cannot be determined Guaranteed false 4. V Z U, X Guaranteed true Cannot be determined Guaranteed false 5. X Z W Guaranteed true Cannot be determined Guaranteed false 8

Q4. [9 pts] Sampling Assume the following Bayes net, and the corresponding distributions over the variables in the Bayes net: A P(A) +a 1/5 a 4/5 B P(B) +b 1/3 b 2/3 A B C P(C A, B) +a +b +c 0 +a +b c 1 +a b +c 0 +a b c 1 a +b +c 2/5 a +b c 3/5 a b +c 1/3 a b c 2/3 C D P(D C) +c +d 1/2 +c d 1/2 c +d 1/4 c d 3/4 (a) [1 pt] Your task is now to estimate P(+c a, b, d) using rejection sampling. Below are some samples that have been produced by prior sampling (that is, the rejection stage in rejection sampling hasn t happened yet). Cross out the samples that would be rejected by rejection sampling: a b + c + d +a b c + d a b + c d +a b c d a + b + c + d a b c d (b) [2 pts] Using those samples, what value would you estimate for P(+c a, b, d) using rejection sampling? 1/2 (c) [3 pts] Using the following samples (which were generated using likelihood weighting), estimate P(+c a, b, d) using likelihood weighting, or state why it cannot be computed. a b c d a b +c d a b +c d We compute the weights of each solution, which are the product of the probabilities of the evidence variables conditioned on their parents. w 1 = P ( a)p ( b)p ( d c) = 4/5 2/3 3/4 w 2 = w 3 = P ( a)p ( b)p ( d +c) = 4/5 2/3 1/2 so normalizing, we have (w 2 + w 3 )/(w 1 + w 2 + w 3 ) = 4/7. (d) [3 pts] Below are three sequences of samples. Circle any sequence that could have been generated by Gibbs sampling. Sequence 1 Sequence 2 Sequence 3 1 : a b c +d 1 : a b c +d 1 : a b c +d 2 : a b c +d 2 : a b c d 2 : a b c d 3 : a b +c +d 3 : a b +c +d 3 : a +b c d The first and third sequences have at most one variable change per row, and hence could have been generated from Gibss sampling. In sequence 2, the second and third samples have both C and D changing. 9

Q5. [13 pts] Multiple-choice and short-answer questions In the following problems please choose all the answers that apply, if any. You may circle more than one answer. You may also circle no answers (none of the above) (a) [2 pts] Value iteration: (i) Is a model-free method for finding optimal policies. (ii) Is sensitive to local optima. (iii) Is tedious to do by hand. (iv) Is guaranteed to converge when the discount factor satisfies 0 < γ < 1. (iii) and (iv). Value iteration requires a model (an specified MDP), and is not sensitive to getting stuck in local optima. (b) [2 pts] Bayes nets: (i) Have an associated directed, acyclic graph. (ii) Encode conditional independence assertions among random variables. (iii) Generally require less storage than the full joint distribution. (iv) Make the assumption that all parents of a single child are independent given the child. (i), (ii), and (iii) all three are true statements. (iv) is false given the child, the parents are not independent. (c) [2 pts] If we use an ɛ-greedy exploration policy with Q-learning, the estimates Q t are guaranteed to converge to Q only if: (i) ɛ goes to zero as t goes to infinity, or (ii) the learning rate α goes to zero as t goes to infinity, or (iii) both α and ɛ go to zero. (ii). The learning rate must approach 0 as t in order for convergence to be guaranteed. Note that Q- learning learns off policy (in other words, it learns about the optimal policy, even if the policy being executed is sub-opitimal). This means that ɛ need not approach zero for convergence. (d) [2 pts] True or false? Suppose X and Y are correlated random variables. Then True. This is the product rule. P (X = x, Y = y) = P (X = x)p (Y = y X = x) (e) [3 pts] MDPs For this question, assume that the MDP has a finite number of states. (i) [true or false] For an MDP (S, A, T, γ, R) if we only change the reward function R the optimal policy is guaranteed to remain the same. (ii) [true or false] Value iteration is guaranteed to converge if the discount factor (γ) satisfies 0 < γ < 1. (iii) [true or false] Policies found by value iteration are superior to policies found by policy iteration. (f) [2 pts] Reinforcement Learning (i) [true or false] Q-learning can learn the optimal Q-function Q without ever executing the optimal policy. (ii) [true or false] If an MDP has a transition model T that assigns non-zero probability for all triples T (s, a, s ) then Q-learning will fail. 10