A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R Federico Perea Justo Puerto MaMaEuSch Management Mathematics for European Schools 94342 - CP - 1-2001 - DE - COMENIUS - C21 University of Seville This project has been carried out with the partial support of the European Union in the framework of the Sokrates programme. The content does not necessarily reflect the position of the European Union, nor does it involve any responsibility on the part of the European Union. 0

1 Introduction This article is about the popular TV game show Who Wants To Be A Millionaire? R. When this paper was written there were 45 versions of Who Wants To Be A Millionaire? R, presented in 71 countries. In more than 100 countries the licence has been bought by TV stations and the show will be broadcasted sooner or later. Who Wants To Be A Millionaire? R debuted in the United Kingdom in September of 1998 and was very successful there. Afterwards, it spread all over the world, coming to Spain in the summer of 2000. Here, it was broadcasted by the TV station Telecinco under the title Quiere ser millonario? R. The rules of the game are similar in all countries. In this article, we will regard only the Spanish version. One candidate is chosen out of a pool of ten and has the chance of winning the top prize of 300000 Euros. In order to achieve this, she must answer 15 multiple-choice-questions correctly in a row. The contestant may quit at any time by keeping her earnings. In each step, she is shown the question and four possible answers before deciding whether to play or not. Once she has decided to stay in the game the next answer has to be correct to continue playing. Each question has a certain monetary value, given here in Euros. The money that a candidate can win if she answers the question correctly is given in table 1: There are three stages ( guarantee points ) where the money is banked and cannot be lost even if the candidate offers an incorrect answer to one of the next questions: 1800, 18000 and 300000 Euros. There is no time limit for answering a question. If time runs out in a particular day, the next programme continues with that player s game. At any point, the contestant may use one or more of her three lifelines. These are: 50:50 option: the computer eliminates two possible answers leaving one wrong answer and the correct one. Phone a friend: the contestant may discuss the question with a friend or relative on the phone for 30 seconds. Ask the audience: the audience has the option of choosing the answer they consider correct by pressing the corresponding button on their keypads. The result of this poll will be listed in percentages. In the sequel we will refer to those lifelines as 1

question index monetary value 1 150 2 300 3 450 4 900 5 1800 6 2100 7 2700 8 3600 9 4500 10 9000 11 18000 12 36000 13 72000 14 144000 15 300000 Table 1: Immediate rewards lifeline 1 for the 50:50 option, lifeline 2 for Phone a friend, lifeline 3 for Ask the audience. Each lifeline may be used only once during a contestant s entire game. The primary aim of this paper is to show how a difficult real decision making problem can be easily modelled and solved by basic Operations Research tools, in our case by Discrete event Dynamical Programming. In this regard, the paper is quite simple from the mathematical point of view. Nevertheless, the modelling phase is not so straightforward and, moreover, this approach can be used as motivating case analysis when presenting Dynamical Programming in classroom. This aim is achieved performing different phases: 1. modelling, 2. mathematical formulation, 3. simulation of the actual process. 2

In the modelling phase we identify the essential building blocks that describe the problem and link them to elements of mathematical models. In the formulation phase we propose a description of the game as a discrete-time Markov decision process that is solved by discrete event dynamic programming. Two models are presented that guide the players to find optimal strategies maximizing their expected reward, which will be called maximum expected strategy, and optimal strategies to maximize the probability of reaching a given question, which will be called maximum probability strategy. In doing that we establish two mathematical models of the game and find optimal strategies of a candidate participating in the TV-game. This is achieved through a mathematical description of the game as a discrete-time Markov decision process that is solved by discrete event dynamic programming. The rest of the paper is organized as follows: the second section is devoted to show the general mathematical model (states, feasible actions, rewards, transition function, probabilities of answering correctly and their estimations). We present in the third section the description of the first model. In that model we want to maximize the expected reward. Also in this section we show the case where we want to maximize the probability of reaching and answering correctly a given question, starting in any departing state. After that, we present some concluding remark based on simulation of how to play in a dynamic way. 2 The general model The actual game requires that the contestant makes decisions each time a question is answered correctly. The planning horizon is finite, we have N = 16 stages, where the 16 th stage stands for the situation after answering correctly question number 15. To make a decision, the candidate has to know the index of the question she faces and the lifelines she has already used. The history of the game is summarized in this information. We define S as the set of state vectors s = (k, l 1, l 2, l 3 ), where k is the index of the current question and { 1 if lifeline i may be used, l i = 0 if lifeline i was already used in an earlier question. At any state s S let A(s) denote the set of feasible actions in this state. If we suppose we are in state s = (k, l 1, l 2, l 3 ), A(s) depends on the question index and the lifelines left. If k = 16, the game is over and there are no feasible actions. If k 15, the candidate has several possibilities: Answer the question without using lifelines. 3

r 0 0 r 1 150 r 2 300 r 3 450 r 4 900 r 5 1800 r 6 2100 r 7 2700 r 8 3600 r 9 4500 r 10 9000 r 11 18000 r 12 36000 r 13 72000 r 14 144000 r 15 300000 r0 0 r1 0 r2 0 r3 0 r4 0 r5 1800 r6 1800 r7 1800 r8 1800 r9 1800 r10 9000 r11 9000 r12 9000 r13 9000 r14 9000 r15 300000 Table 2: Immediate versus ensured rewards Answer the question employing one or more lifelines, if any is left. In this case, the candidate must also specify the lifelines she is going to use. Stop and quit the game. If the player decides not to answer, the immediate reward is the monetary value of the last question answered. If the candidate decides to answer, the immediate reward is a random variable and depends on the probability of answering correctly. If the candidate fails, the immediate reward is the last guarantee point reached before failing. If the candidate decides to answer and chooses the correct alternative, there is no immediate reward. The candidate goes on to the next question, and the reward is the expected (final) reward. Denote r k the immediate reward if the candidate decides to quit the game after answering correctly question k, i.e., if the candidate stops in a state s = (k + 1, l 1, l 2, l 3 ), and denote r k the immediate reward if the candidate fails in a state s = (k + 1, l 1, l 2, l 3 ). See table 2. After a decision is made, the process evolves to a new state. 4

If the candidate decides to stop or if she fails in a question, the game is over. If she decides to play and chooses the correct answer, there is a transition to another state t(s, a) = (k, l 1, l 2, l 3) S, where the question index k is equal to k + 1 and the lifeline indicators l i are: l i = { l i 1 l i if the candidate uses lifeline i in this question, otherwise. Answering correctly depends on certain probabilities for each question, being the same for all candidates. We further assume that the probabilities can be influenced by using lifelines, which we suppose to be helpful (i.e. to increase the probability of answering correctly). Denote p a s the probability of answering correctly if in state s S action a A(s) is chosen. Our analysis takes into account the possible skill of the participants. For that, we will divide the participants in four groups, namely A, B, C, D. Belonging to these groups means that the a priori probability p a s is modified according to a skill factor associated with each group. Mathematically speaking this is reflected in a multiplicative factor h G, G {A, B, C, D}, that modifies the probability in the following way: h G p a S, G {A, B, C, D}, where h A = 1, h B = 0.9, h C = 0.8, h D = 0.7. This means that the lower the skill the participant has, the smaller the probabilities of answering correctly are. One of the cornerstones in the resolution of the actual problem is to get a good estimation of the probabilities in the decision process. For a realistic estimation, one would need detailed data: for each question and for each possible combination of lifelines, there would have to be a certain number of candidates who answered correctly and failed, and this number would have to be high enough to estimate the probabilities. As mentioned above, the actual data are only available for approximately 40 games broadcasted on Spanish TV, and, of course, for most combinations of lifelines there are no observations, making not possible to give an estimation of the probabilities. Nevertheless, we had enough information to be able to estimate the probabilities of answering correctly without using any lifelines and using one single lifeline. Therefore, in order to solve the problem we make further assumptions. Let p k denote the probability of answering correctly without using any lifeline. We assume that there exists a multiplicative relationship between the probability of failing in a given state 5

using lifeline i and the probability of failure without lifelines. This relation is such that the probability of failing decreases by a fix factor c i, 0 < c i < 1, i = 1, 2, 3, or in other words: p i k = 1 (1 p k)c i k, (1) where p i k is the probability of answering correctly the question number k using the ith lifeline (both p k and p i k are known, for all k, i). We assume further that the combination of several lifelines modifies the original probability (1 p k ) in a multiplicative way, multiplying the different c constants. This simplification allows us to give a heuristic expression of the probabilities, that can be justified because we did not have enough data to give a valid estimation for each combination of lifelines. Under this assumption, we can use the information that we have about the candidates to estimate the probabilities of answering correctly with any feasible combination of lifelines. In the sequel we estimate the probabilities of answering correctly without using any lifelines and the constants c i k from the available data. For every question index k, we consider the candidates who did not use any lifelines and those who used only one lifeline. Then, for each of these groups of candidates, respectively, we consider the number of candidates who answered correctly this question and the number of candidates who failed. These probabilities are estimated by the relative frequencies observed in the data, and they are shown in table 3. Let p k denote the probability of answering correctly the kth question without using lifelines, p 1 k the probability of answering correctly using lifeline 1 (50:50 option), p2 k the probability of answering correctly using lifeline 2 (phone a friend) and p 3 k the probability of answering correctly using lifeline 3 (ask the audience). In table 3 get the probabilities of answering correctly (all the probabilities in %) 1. We use our model in the equation (1) to estimate the values of the c constants. Thus, for each question index k the factor c i k that modifies the probability when lifeline i is used is given by: c i k = 1 pi k 1 p k 1 original value 100% replaced by 99% 6

question index k p k p 1 k p 2 k p 3 k 1 97 99 99 99 2 95 99 99 99 3 92 99 99 99 4 86 93 99 95 5 80 91 98 93 6 79 99 99 99 7 76 87 90 88 8 63 70 78 69 9 51 67 70 65 10 43 58 66 52 11 39 57 68 50 12 38 54 64 49 13 40 54 60 47 14 37 50 62 48 15 36 52 60 45 Table 3: Estimated probabilities of correct answers 3 Mathematical formulation In this section we present two different models. The first model is thought to maximize the expected reward and the second to maximize the probability of reaching a fixed goal. Both of them, apart from giving us expected reward and maximum probability, also give us optimal strategies to achieve their respective goals. 3.1 Model 1: expected reward Let p a s denote the probability of answering correctly if in state s S action a A(s) is chosen. Suppose that the probabilities p a s depend only on the question index and on the used lifelines. Let f(s) be the maximum expected reward that can be obtained starting at the state s. We can evaluate f(s) in the following way. The maximum expected reward from s will be on the maximum among the expected rewards that can be obtained according with the different courses of actions a A(s). At that point, we can either quit the game, thus ensuring r k 1, or go for the next question (assume indexed by k). In the latter case, if we choose an action a A(s) then we answer correctly with probability p a s and fail with probability (1 p a s). The 7

k c 1 k c 2 k c 3 k 1 0.3333 0.3333 0.3333 2 0.2 0.2 0.2 3 0.125 0.125 0.125 4 0.5 0.0714 0.3571 5 0.45 0.1 0.35 6 0.0476 0.0476 0.0476 7 0.5416 0.4166 0.5 8 0.8108 0.5945 0.8378 9 0.6734 0.6122 0.7142 10 0.7368 0.5964 0.8421 11 0.7049 0.5245 0.8196 12 0.7419 0.5806 0.8225 13 0.7666 0.6666 0.8833 14 0.7936 0.6031 0.8253 15 0.75 0.625 0.8593 Table 4: Correction factors reward when failing is given by the prior reward ensured to the question k, i.e. rk 1. On the other hand, answering correctly question k produces a transition to the next question with the remaining lifelines. Denote by t(s, a) the transition function that gives the new state when action a is chosen in state s. Then, from that point on the expected reward is f(t(s, a)). In summary, the expected reward under action a is: p a sf(t(s, a)) + (1 p a s)r k 1. Hence, f(s) = max a A(s) {r k 1, p a sf(t(s, a)) + (1 p a s)r k 1}. In order to get the maximum expected reward we have to evaluate f(departing state). If the candidate starts in question number 1 with the three lifelines we have to compute f(1, 1, 1, 1). The values of f can be computed recursively by backward induction once we know the value of f at any feasible state of the terminal stage. These values are easily computed and they are shown in table 5. 8

State f(state) 15,1,1,1 224976.5 15,0,0,1 144000 15,0,1,0 183600 15,0,1,1 199968.75 15,1,1,0 212700 15,1,0,1 179962.5 15,1,0,0 160320 15,0,0,0 144000 Table 5: Departing state probabilities. Therefore, using backward induction starting with the data in the table above we obtain f(1, 1, 1, 1) and find optimal strategies. In this process we use the estimated probabilities and constants obtained in Section 2. All computations were performed with a MAPLE computer program. The solution obtained by the program is f(1, 1, 1, 1) = 2490.89 and an optimal strategy is shown in table 6. 3.2 Model 2: reaching a question In this section we address a different solution approach to our problem. We have seen in Section 3.1 the optimal strategy to be used if we want to maximize the expected reward and how much we win following it. Now we want to find the optimal strategy that we have to follow in order to maximize the probability of reaching and answering correctly a given question. Moreover, we also give the probability of doing that if we follow an optimal strategy. Let us define the new problem. Recall that a state s is defined as a four-dimensional vector, as before: s = (k, l 1, l 2, l 3 ). Let k, e = 1, 2,, 15, be a fix number. Our goal is to answer correctly the question number k. We denote by f(s) the maximum probability of getting and answering correctly the question number k, starting in state s. We evaluate f(s) in the following way. The maximum probability of reaching and answering correctly the question number k starting in state s is the maximum among the possible actions a A(s) of the probability of answering correctly the current question times the 9

Question index Strategy 1 No lifelines 2 No lifelines 3 No lifelines 4 No lifelines 5 Audience 6 No lifelines 7 No lifelines 8 No lifelines 9 50:50 Option 10 Phone 11 No lifelines 12 No lifelines 13 Stop expected reward 2490.89 Table 6: Solution of Model 1. maximum probability of getting our goal from the state t(a, s), a A(s), where t(a, s) is the transition state after choosing the action a in the state s if answering correctly. Then, we have: f(k, l 1, l 2, l 3 ) = max 0 g i l i g i Z, i {p k,g1,g 2,g 3 f(k + 1, l 1 g 1, l 2 g 2, l 3 g 3 )}, where p k,g1,g 2,g 3 is the probability of answering correctly the k th question using some lifelines. Indeed, i th lifeline is used if g i = 1, i = 1, 2, 3. The function f is a recursive functional, therefore to obtain its evaluation by backward induction we need its value at all the states in the terminal stage. Notice that the goal in this formulation is to reach the stage k. Thus, the probability of having reached the stage k if we are already at k + 1 is clearly 1. Hence, we have f(k + 1, l 1, l 2, l 3 ) = 1 l i {0, 1}, i = 1, 2, 3. Once we have the evaluation of the function at the terminal stage, the solution of this model is the evaluation of f(departing state). If we start from the first question and we have all the lifelines, the departing state is (1,1,1,1). But if we start at the third question and we only 10

Question index Goal: 5 Goal: 10 Goal: 13 Goal:15 1 No lifelines No lifelines No lifelines No lifelines 2 No lifelines No lifelines No lifelines No lifelines 3 50:50 No lifelines No lifelines No lifelines 4 Audience No lifelines No lifelines No lifelines 5 Phone No lifelines No lifelines No lifelines 6 Audience No lifelines No lifelines 7 No lifelines No lifelines No lifelines 8 No lifelines No lifelines No lifelines 9 50:50 Audience No lifelines 10 Phone No lifelines No lifelines 11 Phone Phone 12 50:50 No lifelines 13 No lifelines No lifelines 14 Audience 15 50:50 Probability 0.85 0.12 0.01 0.001 Table 7: Optimal strategies for Model 2. have the 50:50 and the audience lifelines, the departing state would be (3,1,0,1). Anyway, the algorithm we propose solves that problem starting in any departing state and having as goal any level of the game. We use the estimated probabilities and constants c i, calculated as before, in a computer program in MAPLE to evaluate the function f and to find optimal strategies. In this model we do not have a unique solution but 15, because we may have fifteen possible goals: the fifteen questions we have in the game. For the sake of brevity we only show the solutions obtained if we start in state (1,1,1,1) and we want to reach and answer correctly the questions number 5, 10, 13 or 15. The optimal strategies and the probabilities of reaching and answering correctly the goals mentioned above are shown in table 7. The last row of the table above represents the probabilities of getting the proposed goal. 11

4 Further analysis of the game We have solved the problem in a static way, because of all the probabilities are determined a priori, that is, without the actual knowledge of each question. Actually the game is played changing the probabilities of answering correctly each time that the player faces the current question. For example, if in the fourth question, once she knows the actual question, she can estimate her probability of answering correctly that question, based on her own knowledge on the question. Then, changing that probability in our scheme, she evaluates again the function starting in the current state and keeping the others probabilities unchanged. This analysis means that the player modifies at each stage k the probability p k of answering correctly according with her own knowledge of the subject. This would be a realistic dynamic way to play the game. This feature has been incorporated in our computer code, so that, at each stage the player can change the probability of answering correctly the current question. Notice that this argument does not modify our recursive analysis of the problem. It only means that we allow to change the probability p k at each step of the analysis. 4.1 Simulation In order to illustrate our analysis of this game we perform a simulation of the process to check the behavior of the winning strategies proposed in our models. As we mentioned in Section 2 we classify the participants in four groups as follows: Players in group A have the original probabilities described previously. Probabilities of answering correctly in group B are the probabilities for group A multiplied by 0.9. Probabilities of answering correctly in group C are the probabilities for group A multiplied by 0.8. Probabilities of answering correctly in group D are the probabilities for group A multiplied by 0.7. In the following, we present two tables (table 8) with the strategy that one participant of each group (A, B, C and D) should take in order to maximize her expected reward (Model 1) and the strategy to maximize the probability of winning, at least, that maximum expected reward (Model 2). For example, the last row in the column of the participant A in Model 1 shows us the expected reward that she would get following the strategy described in that column, and the last row in Model 2 is the probability of winning at least that maximum 12

expected reward. To win at least 2490.9 euros we have to answer correctly question number 7. The other cases are analogous. The last row in both tables shows the maximum expected reward, if Model 1, or the probability of being successful if we follow the strategy described on Model 2. To finish this section we are going to show a simulation of Model 1 of the game played in its dynamic version. This is, we will assume that on each actual question the probability of answering correctly are modified once the concrete question is known. Let us assume that the contestant is now facing the question k th. She is deciding whether answering the question and how, or not depending on the degree of difficulty of the actual question. The model assumes that the probabilities of answering correctly the next questions, that is from k + 1 on, are the original ones estimated before. In table 9 the strategies using the lifeline of the 50:50, Phone and Audience are denoted by 50, P and A respectively. In order to simplify the simulation we assume that the probabilities of answering correctly can be: 1 if the contestant knows the right answer. 0.5 if the contestant doubts between two answers. 0.33 if she simply is sure that one of the answers is incorrect. 0.25 if she does not know anything about the answer and the four of them can be possible for her. The reader may notice that any kind of a priori probabilistic information, based on the knowledge of the actual player, can be incorporated into the model. This incorporation is done by computing posterior probabilities using Bayes rule. It is clear that the strategies change depending on the probabilities of the question that the contestant is facing, which have been chosen at random using different probability functions for each question number. The first number in each cell is the actual probability of answering correctly the corresponding question. As can be seen, depending on the simulated probability, the strategies can vary from stopping at the fifth until the twelfth question. 13

Skill A Skill B Question Model 1 Model 2 Model 1 Model 2 1 Without Without Without Without 2 Without Without Without Without 3 Without Without Without 50:50 4 Without Without Audience Audience 5 Without Phone Phone Phone 6 Audience 50:50 Without Stop 7 Without Audience Without 8 Without Stop Without 9 50:50 Without 10 Phone 50:50 11 Without Without 12 Without Without 13 Stop Stop 14 15 E.R / Prob 2490.9 0.622 1289.4 0.557 Skill C Skill D Question Model 1 Model 2 Model 1 Model 2 1 Without Without Without Without 2 Without Audience Without Without 3 50:50 50:50 50:50 50:50 4 Audience Phone Audience Audience 5 Phone Stop Phone Phone 6 Without Without Stop 7 Without Without 8 Without Stop 9 Stop 10 11 12 13 14 15 E.R / Prob 747.5 0.482 421.1 0.475 Table 8: Optimal solutions depending on the player s skill 14

Question P1 P2 P3 P4 P5 P6 1 1/NL 1/NL 0.5/50-A 0.5/50-A 0.5/50-A 1/NL 2 0.5/50 0.5/P 1/NL 0.33/ P 1/NL 1/NL 3 1/NL 0.33/A 0.5/P 1/NL 1/NL 0.33/50 4 1/NL 0.5/50 0.5/NL 1/NL 0.5/P 1/NL 5 0.5/P 0.25/Stop 0.5/NL 0.33/NL 0.5/NL 1/NL 6 0.5/A 0.33/NL 0.5/NL 1/NL 0.5/A 7 0.5/NL 1/NL 0.5/NL 0.33/NL 1/NL 8 1/NL 0.5/NL 0.5/NL 1/NL 0.5/NL 9 0.33/Stop 0.33/Stop 0.33/Stop 0.25/Stop 1/NL 10 0.25/P 11 0.25/NL 12 0.25/Stop 13 14 15 Table 9: Simulation References [1] Chlond M.J. (2001), The Travelling Space Telescope Problem, INFORMS Transactions on Education 2:1 (58-60). [2] Cochran J.J. (2001), Who Wants To Be A Millionaire R : The Classroom Edition, INFORMS Transactions on Education 1:3 (112-116). [3] Rump C.M. (2001), Who Wants to See a $Million Error?. A Neglected Educational Resource, INFORMS Transactions on Education 1:3 (102-111). [4] Heyman D. and Sobel M. (1984), Stochastic Models in Operations Research. Vol 2, McGraw-Hill, New York. [5] Sniedovich M. (2003), A Neglected Educational Resource, INFORMS Transactions on Education 2:3, 86-95. [6] Sniedovich M. (2002), Towers of Hanoi, INFORMS Transactions on Education 3:1 (34-51). 15

[7] Sniedovich M. (2000), Counterfeit Coin Problem INFORMS Transactions on Education 3:2 (32-41). [8] Tijms H.C. (1986), Stochastic modeling and analysis. A computational approach. WILEY, New York. 16