MAS113 Introduction to Probability and Statistics

MAS113 Introduction to Probability and Statistics 1 Introduction 1.1 Studying probability theory There are (at least) two ways to think about the study of probability theory: 1. Probability theory is a branch of Pure Mathematics. We start with some axioms and investigate the implications of those axioms; what results are we able to prove? Probability theory is a subject in its own right, that we can study (and appreciate!) without concerning ourselves with any possible applications. We can also observe links to other areas of mathematics. 2. Probability theory gives us a mathematical model for explaining chance phenomena that we observe in the real world. As with any mathematical theory, it may not describe reality perfectly, but it can still be extremely useful. We must think carefully about when and how the theory can be applied in practice. We will be studying probability from both perspectives. Applications of probability theory are important and interesting, but as a mathematician, you should also develop an understanding and an appreciation of the theory in its own right. 1

1.2 Motivation We are often faced with making decisions in the presence of uncertainty. A doctor is treating a patient, and is considering whether to prescribe a particular drug. How likely is it that the drug will cure the patient? A probation board is considering whether to release a prisoner early on parole. How likely is the prisoner to re-offend? A bank is considering whether to approve a loan. How likely is it that the loan will be repaid? A government is considering a CO 2 emissions target. What would the effect of a 20% cut in emissions be on global mean temperatures in 20 years time? We often use verbal expressions of uncertainty ( possible, quite unlikely, very likely ), but these are often inadequate if we want to communicate with each other about uncertainty. Consider the following example. You are being screened for a particular disease. You are told that the disease is quite rare, and that the screening test is accurate but not perfect; if you have the disease, the test will almost certainly detect it, but if you don t have the disease, there is a small chance the test will mistakenly report that you have it anyway. The test result is positive. disease? How certain are you that you really have the Clearly, in this example and in many others, it would be useful if we could quantify our uncertainty. In other words, we would like to measure how likely 2

it is that something will happen, or how likely it is that some statement about the world turns out to be true. In this course we introduce a theory for measuring uncertainty: probability theory. We have two ingredients: set theory, which we will use to describe what sort of things we want to measure our uncertainty about, and, appropriately, measure theory, which sets out some basic rules for how to measure things in a sensible way. 1.3 Motivation - the need for set theory and measures If you have studied probability at GCSE or A-level, you may have seen a definition of probability like this: Suppose all the outcomes in an experiment are equally likely. The probability of an event A is defined to be P (A) := number of outcomes in which A occurs. (1) total number of possible outcomes Example 1. A deck of 52 playing cards is shuffled thoroughly. What is the probability that the top card is an ace? The definition can work for examples such as this (with a finite number of equally likely outcomes), but we soon run into difficulties: Example 2. Imagine generating a random angle between 0 and 2π radians, where any angle is equally likely. (For example, consider spinning a roulette wheel, and measuring the angle between the horizontal axis (viewing the wheel from above) and a line from the centre of the wheel that bisects the zero on the wheel). What is the probability that the angle is between 0 and 2 radians? 3

You might guess that the answer should be 2 2π, but the definition in (1) causes us problems! There are infinitely many possible angles that could be generated, and there are infinitely many angles between 0 and 2 radians. Clearly, we can t divide infinity by infinity and come up with 2 2π. It may not make much sense to think of outcomes as equally likely: Example 3. What is the probability that a team from Manchester will win the Premier League this season? We will resolve this by constructing a more general definition of probability, using the tools of set theory and a special type of function called a measure, and we ll show how choosing different types of measure will give us sensible answers in both the above examples. 2 Set theory Set theory is covered in more detail in MAS110, but we will review the main points here. 2.1 Sets Definition 1. A set is a collection of objects, with each object referred to as an element or a member of the set. Some examples: 4

1. S 1, a set of some colours: S 1 = {yellow, red, blue, green, purple}. 2. N, the natural numbers: N = {1, 2, 3, 4,...}. We say that S 1 is a finite set, as it has 5 members, and N is an infinite set, as there are infinitely many natural numbers. We use the notation red S 1, to mean that red is a member or element of the set S 1. You should be able to guess what is meant by 1 / N. 2 We represent some sets using interval notation. Definition 2. We define open, closed, and half-open intervals as follows. open interval : (a, b) := {x R; a < x < b}, closed interval : [a, b] := {x R; a x b}, half-open interval : (a, b] := {x R; a < x b}, or [a, b) := {x R; a x < b}, where you should read the semicolon to mean such that. Note that we may write, for example, (a, ), but we do not write (a, ]. Definition 3. For any two sets A and B we say that A is a subset of B if every element of A is also an element of B, and we write A B. 5

This does not rule out the possibility that A and B are the same set. If there is at least one element of B that is not in A, we write A B. In this case, we say that A is a proper subset of B. For example, if R = {2, 4, 6, 8, 10, 12,...} is the set of all even numbers, then R N. Note the difference between and. If S = {1, 2, 3}, we write 1 S, {1} S. Definition 4. We define the empty set to be the set that does not contain any elements. We denote the empty set by. 2.2 Operations on sets: unions, intersections, complements, and set differences We start with a universal set S, which lists all the elements we wish to consider for the situation at hand. (Note that, in spite of the word universal, the choice of S depends on context.) We can then perform operations on subsets of S. Definition 5. The union of two sets A and B, written A B is the set of all elements that are either in set A, or in set B, or in both. We write A B := {x S; x A or x B}. 6

We can visualise this using a Venn diagram (see Figure 1). The rectangle represents the universal set, the two circles represent the subsets A and B, and the shaded area represents the set A B. Figure 1: A Venn diagram, with the shaded region showing A B. Definition 6. The intersection of two sets A and B, written A B is the set of all elements that are both in the set A and in the set B. We write A B := {x S; x A and x B}. Figure 2: A Venn diagram, with the shaded region showing A B. 7

Definition 7. If there are no elements in S that are both in A and in B, we say that A and B are disjoint, and we write A B =. On a Venn diagram, you can think of disjoint sets as non-overlapping. Figure 3: A Venn diagram with disjoint A and B. Definition 8. The complement of a set A, written as elements that are in S, but not in A. We write Ā, is the set of all Ā = {x S; x / A}. Alternative notation: Ā is also written as A C and A. Note that = S. 8

Figure 4: A Venn diagram, with the shaded region showing Ā. Definition 9. The set difference, written as A\B, is the set of all elements that are in A, but are not in B. We write A \ B := {x S; x A and x / B}. Alternative notation: A \ B is also written as A B. Figure 5: A Venn diagram, with the shaded region showing A \ B. Note that A \ B = A B. Example 4. Suppose we have S = [0, 10] = {x R; 0 x 10}, A = (1, 5) = {x R; 1 < x < 5}, B = [3, 7] = {x R; 3 x 7}. 9

What are A B, A B, Ā and A \ B? Theorem 1. (De Morgan s Laws) For any two sets A and B, (A B) = Ā B, (A B) = Ā B. You will prove these results as an exercise in MAS110. 3 Set theory and probability We now discuss the role of set theory in probability. We consider uncertainty in the context of an experiment, where we use the word experiment in a loose sense to mean observing something in the future, or discovering the true status of something that we are currently uncertain about. In an experiment, suppose we want to consider how likely some particular outcome is. We might start by considering what all the possible outcomes of the experiment are. We can use a set to list all the possible outcomes. Definition 10. A sample space is a set which lists all the possible outcomes of an experiment. In the context of probability, the sample space will play the role of the universal set S. Example 5. Examples of sample spaces Definition 11. An event is a subset of a sample space. If the outcome of the experiment is a member of the event, we say that the event has occurred. Example 6. Examples of events 10

Given a sample space S and two subsets/events A and B, the set operations union, intersection, complement and difference all define further events: Set notation A B A B A A \ B the event in words either A occurs or B occurs, or both occur both A and B occur A does not occur A occurs, but B does not Example 7. Combinations of events 3.1 The difference between an outcome and an event Let s clarify the difference between an outcome and an event. An outcome is an element of the sample space S, whereas an event is a subset of S. Consider the following example. Suppose Spain play Germany at football this evening. In a simple approach, the sample space is S = {Spain win, Germany win, draw}. Before the match, you can bet on an event, eg Either Spain win or draw. Tomorrow, a news reporter will tell you the outcome, eg Spain won. The news reporter would not report the event either Spain won or the match was drawn last night. 11

If the observed outcome belongs to the event either Spain win or draw, the event has occurred, and you have won your bet. An event could be a single outcome (you could bet on Spain win ). For a sample space S, outcome a and event A, we would use the notation a S and A S. Make sure you understand when to use the symbol and when to use the symbol or. 4 Measure 4.1 Set functions We can define set functions that take sets as their inputs and produce real numbers as outputs.!"#$%&'(')*%' )*%'0$"12+"' +$%#$%&'(',*(-'"$./*,' Not all set functions make sense for all possible sets, so when defining a set function, we ll first say what the sample space is, and then define set functions that operate on subsets of a sample space. Example 8. Examples of set functions 12

4.2 Measures We re close to being able to give a formal definition of probability: we re going to define probability as a special type of set function that we apply to a set. We first define a measure as a special type of set function. Definition 12. Given a sample space (or universal set) S, a (finite) measure is a set function that assigns a non-negative real number m(a) to every set A S, which satisfies the following condition: for any two disjoint subsets A and B of S (ie A B = ), m(a B) = m(a) + m(b). (2) Note: This is a simplified definition of measure which is sufficient for this module. There are further technical conditions in the full definition of measure, which we don t need to concern ourselves with here. For more details, see the supplementary online notes. Informally, the above definition can be put into words as follows. A measure is something which assigns a number to a set, produces a non-negative number as its output, and behaves additively : if we take two non-overlapping (disjoints) sets, we can either combine the sets and then apply the measure, or apply the measure to each set and then sum the results, and we will get the same value either way. We can extend the condition in (2) to finite unions of disjoint sets. If three sets A, B and C are all mutually disjoint (A B = A C = B C = ), if 13

we define D = B C, then m(a B C) = m(a D) = m(a) + m(d) = m(a) + m(b C) = m(a) + m(b) + m(c). Example 9. Examples of measures Theorem 2. (Properties of measures) Let m be a measure on for some set S, with A, B S. (M1) If B A, then (M2) If B A, then m(a \ B) = m(a) m(b). m(b) m(a). (M3) m( ) = 0 (M4) m(a B) = m(a) + m(b) m(a B). (M5) For any constant c > 0, the function g(a) = c m(a) is also a measure. (M6) If m and n are both measures, the function h(a) = m(a) + n(a) is also a measure. 4.3 Measures on partitions of sets Definition 13. A partition of a set S is a collection of sets E = {E 1, E 2,..., E n } such that 14

1. E i E j =, for all 1 i, j n with i j; 2. E 1 E 2... E n = S. We say that {E 1, E 2,..., E n } are mutually exclusive and exhaustive. If S is a sample space, one of the events in E must occur, but no two events can occur simultaneously. From (2), we have Figure 6: An example of a partition with 5 sets. m(s) = n m(e i ). i=1 5 Probability as measure Definition 14. Suppose S is a sample space and P a measure on S. We say that P is a probability (or probability measure) if P (S) = 1. Hence probability is defined to be a special case of measure: a measure that takes the value 1 on a sample space S. 15

To summarise, using the concept of measure, we have defined probability as follows. We start with a set S of all possible outcomes of the experiment. We call S the sample space. Probability is a function, a measure, that can be applied to a subset of S (including S itself). A probability must be non-negative: we must have P (A) 0 for any set A S. An event that is certain to happen must have probability 1, so that P (S) = 1. If two events A and B cannot both occur (so that A B = ), then the probability that either A occurs or B occurs is some of their two probabilities: P (A B) = P (A) + P (B). This is as far as the mathematical definition of probability goes. There is no general definition to say what value P (A) must take. Rather, in any practical setting, we will choose a particular probability measure that we feel gives a suitable model for the problem at hand. To see this in action, we return to Examples 1 and 2 (where we initially struggled to come up with a suitable definition of probability). Lemma 2 (M5) says that we can multiply or divide a measure by a positive constant to get another valid measure. We will now use the measures f 1 and f 2 given in Example 9, divided by suitable constants. Example 10. Examples of probability measures 16

5.1 Properties of probabilities We can apply what we know about measures to deduce various properties of probabilities. For a sample space S, and A, B S (P1) P (S) = 1. (P2) 0 P (A) 1. (P3) P ( ) = 0. (P4) P (Ā) = 1 P (A). (P5) P (A B) = P (A) + P (B) P (A B). (P6) If A B =, then P (A B) = P (A) + P (B). (P7) If B A then P (A \ B) = P (A) P (B) and P (B) P (A). (P8) If E = {E 1, E 2,..., E n } is partition of S, then P (S) = n i=1 P (E i). You should verify that each of these properties hold, by referring back to the properties of measures discussed earlier. 5.2 The law of large numbers: bridging the gap from theory to observation In Example 1, we said that the probability of the top card being an ace was 4/52. In one sense, all this statement is saying is that there are 52 possible outcomes in total, and for 4 of these outcomes, the event of the top card being an ace will occur. But does this probability value of 4/52 tell us anything 17

about what will actually happen if we do the experiment? Not really. We may see an ace, we may not, and it s not clear how we would relate what we would see to the value of 4/52. However, there is a result in probability theory that tells us what will actually happen over a large number of experiments. This result is the law of large numbers. We will study this result later on this module (and state it more precisely), but for now, we ll just give the following informal version. The law of large numbers (an informal version). Suppose we do a sequence of independent experiments, so that the outcome in one experiment has no effect on the outcome in another experiment. Now suppose in experiment i, there is an event E i that has a probability of p of occurring, for i = 1, 2,.... The proportion of events out of E 1, E 2,... that actually occur as we do the experiments will converge to p, as the number of experiments increase. The sequence of experiments could correspond to repeating the same activity lots of times under the same conditions: If we repeat a large number of times the process of shuffling a deck of cards and drawing the top card, we should expect to see the top card being an ace approximately 8% ( 4/52) of the time. Or the sequence of experiments could correspond to different activities: Let E 1 be the event of tossing a coin and observing a head, E 2 be the event of rolling a die an observing an even number, E 3 be the event of drawing a red playing card from a standard deck, E 4 be the event that a randomly selected new born baby is a boy and so on. Given a sufficiently long list of such events E 1, E 2,..., each with probability 0.5, we should see approximately half these events occur if we do the experiments. 18

This is a profound result for taming uncertainty! Although we can t predict what will happen in any one instance, we can make useful predictions about will happen overall. Suppose, for the same illness, the probability of a patient being cured by drug A is 0.1, and the probability of a patient being cured by drug B is 0.2. We cannot be certain whether any single patient will be cured by either drug. But we can predict confidently that in a large population of patients, 10% will be cured by drug A, and 20% will be cured by drug B. There remains a rather difficult philosophical question of whether a single experiment in the real world actually has a true probability attached to it, but we won t concern ourselves with this! We ll simply take the position that if we can justify our choice of probability values for some experiments in the real world, we can make useful predictions about what will happen in the world, and these predictions can be tested. 6 Assigning probabilities We have used measure theory to establish some basic rules that probabilities must follow, and the law of large numbers tells us what we should observe given a sequence of independent events, each with the same probability occurring. But for modelling any experiment in the real world, there is no theorem to tell us what the correct probabilities are! (Or even if there are correct probabilities!). There are three main methods for assigning probabilities: the classical approach based on symmetry, relative frequencies, and subjective judgements. Arguably, the first two methods are special cases of the third, because they also involve making subjective judgements. 19

6.1 Classical probability: assuming symmetry In the case of a finite sample space, in the classical approach we make the assumption that each outcome in the sample space is equally likely. If there are k outcomes in the sample space, then the probability of any single outcome is 1/k. The probability of any event is then calculated as the number of outcomes in which the event occurs, divided by the total number of possible outcomes. P (A) = A S. In effect, we have chosen P to be the counting measure, divided by the number of elements in the sample space. This is the approach that we used in in Example 1. One situation in which we are usually willing to assume equally likely outcomes is in gambling and games of chance. Example 11. On a European roulette wheel, the ball can land on one of the integers 0 to 36. A bet of one pound on odd returns one pound (plus the original stake) if the ball lands on any odd number from 1 to 35. Supposing you bet one pound a large number N times. According to the law of large numbers, who would you expect to make a profit: you or the casino? 6.2 Counting methods for classical probability The sample space may be very large, so writing out all the possible elements and counting directly may be laborious. Often, we can apply some standard results for permutations and combinations, together with a little logic. Definition 15. Let n P r denote the number of permutations, that is the number of ways of choosing r elements out of n, where the order matters. 20

Then n P r = n! (n r)!. Let ( n r) (also written n C r ) denote the number of combinations, that is the number of ways of choosing r elements out of n, where the order does not matter. Then ( ) n r = n! r!(n r)!. These results are proved in MAS110. For reference, more detail is given in the supplementary online notes. Example 12. Consider the National Lottery (under its new rules). On a single ticket, you must choose 6 integers between 1 and 59. A machine also selects 6 integers between 1 and 59, and it is reasonable to assume that each number is equally likely to be chosen. If I buy one ticket: 1. What is the probability that I match all 6 numbers? 2. What is the probability I match 3 numbers only? 6.3 Classical probability under random sampling We can design experiments in which equally likely outcomes can be assumed. We start with a population of people or items, and then pick a subset of the population, such that each member of the population is equally likely to be selected. This is called (simple) random sampling. (Some care is needed to make sure we are satisfied that each member really does have the same chance of being selected). This concept is important in Statistics, where we want to make inferences about a population based on what we have observed 21

in a random sample. We can use probability theory to understand how the characteristics of a random sample may differ from the characteristics of a population. Example 13. A factory has produced 50 items, some of which may be faulty. If an item is tested for a fault, it can no longer be used (testing is destructive ). To estimate the proportion of faulty items, a sample of 5 items are selected at random for testing. Suppose out of the 50 items, 10 are faulty. 1. What is the probability that none of the 5 items in the sample are faulty? 2. What is the probability that the proportion of faulty items in the sample is the same as the proportion of faulty items in the population? 6.4 Relative frequencies In some situations, we may judge that outcomes are not equally likely, so that the classical approach is not appropriate. However, we may be able to repeat the experiment, and under the assumption that the conditions of the experiment are the same for each repetition, we may assign a probability of an event A as follows. 1. Repeat the experiment a large number of times. 2. Set P (A) to be the proportion of experiments in which A occurs. Here, we are essentially appealing to the law of large numbers. The law of large numbers says that if in each experiment P (A) is the same value p, then the proportion of experiments in which A occurs will converge to p as the number of experiments tends to infinity. Hence, in practice, if we are satisfied 22

that the conditions of the experiment are the same each time, estimating P (A) by the proportion of occurrences in a finite number of experiments seems reasonable. In fact, there is a more formal theory of how to choose a probability based on a finite number of repetitions of the experiment. This is taught in a 3rd year module, Bayesian Statistics. Example 14. I have a backgammon game on my phone. I have played the computer in 73 matches, and have won 50 of them. Assuming my ability hasn t changed over this time, I may judge P ({I win}) = 50 73 and P ({I lose}) = 23 73 for a single match. If, for completeness, I also state that P ( ) = 0 and P (S) = 1, where S = {I win, I lose}, have I specified a valid probability measure? 6.5 Subjective probabilities and fair odds In some situations, we will not believe that the outcomes are equally likely, and the experiment is either not repeatable, or no past observations of the experiment exist. An example would be observing which party gets the most number of seats at the next UK election. If we write the sample space as S = {Conservative, Labour, Liberal Democrats, Other}, no-one with any knowledge of UK politics would judge the Liberal Democrats to have the same chance as Conservative or Labour, and we argue that previous elections were held under different circumstances, and so cannot be thought of as replications of the same experiment. Nevertheless, you can still choose your own probability values, based on your own opinion. Bookmakers do this regularly when setting their opening odds, and generally they are very good at it. In fact, the concept of fair odds can be a helpful way to decide what your probability should be. 23

Definition 16. For an event E, the odds on the event E are and the odds against the event E are P (E) 1 P (E), 1 P (E), P (E) so the odds against an event are 1 over the odds on an event. Odds may be quoted as ratios: if we say the odds on E are three to two (on), we mean the fraction P (E) 1 P (E), is equal to 3 2. The odds may be written in the form 3/2. Bookmakers usually quote odds against an event. If you bet 1 on the event E to occur at odds of x to y against (written as x/y), if E occurs, you get your 1 back, plus a further x y. If you bet 1 on England winning the next football world cup at odds of 19 to 1 against, if England win, you get your 1 back, plus a further 19. Definition 17. Informally, we define fair odds to mean odds that you think favour neither the bookmaker or the gambler. If you think the odds favour the gambler, you d expect the bookmaker to make a loss. If you think the odds favour the bookmaker, you d expect the bookmaker to make a profit. Example 15. Suppose I roll a fair six-sided die, and offer you odds of 3 to 1 against the outcome being a six. Do you think these odds are fair? If not, what would the fair odds be? Example 16. On a European roulette wheel (numbers from 0 to 36 inclusive, each number appears once), a casino offers odds of 2 to 1 against the result being from 1 to 12 inclusive. What probability of getting 1-12 would this imply? What is the classical probability? 24

You can decide on your fair odds for any event. Having chosen fair odds against an event E of x to y, the implied probability is P (E) = 1 1 + x. y Different people, depending on their knowledge, will judge different odds to be fair, and so will have different probabilities. Does this result in a valid probability measure? Under a rather more precise definition of fair odds, the answer is yes. Further details are given in the supplementary online notes. You can look at a bookmaker s quoted odds to give an indication of their subjective probability for an event in question, though their actual probabilities will be smaller, for reasons that we will explain later in the course. (If you calculate the probabilities for all the possible outcomes from a bookmaker s odds, the probabilities will sum to more than 1). A bookmaker s odds will also be influenced by how popular the bets are. For example, small odds against an event may be (partly) the result of lots of people betting on the event, and the bookmaker lowering the odds to protect against potential losses. At the time of writing (August 2015), here are some quoted odds from various bookmakers for some events. Event Odds Implied probability Lewis Hamilton wins 2015 F1 championship 1/6 0.86 Chelsea win 2015/16 Premier League 3/1 0.25 Sheffield Wednesday promoted in 2015/16 13/2 0.13 Boris Johnson next Tory leader 4/1 0.20 Jessica Ennis-Hill to win BBC Sports Personality 5/2 0.29 Ireland to win next Cricket World Cup 100/1 0.01 Hillary Clinton is the next US president 1/1 0.50 Although it is impossible to state whether any single subjective probability is correct, we can still evaluate the performance of subjective probability 25

assessments in the long run. For example, some weather forecasters provide subjective probabilities, for example probabilities that it will rain the next day. If we look at all the occasions where the forecaster has said there is a 40% chance of rain tomorrow, we would hope to see that there was rain the following day on (approximately) 40% of these occasions. 7 Conditional probability 7.1 Introduction and definition Conditional probability is an important concept that we can use to change a measurement of uncertainty as our information changes. Example 17. For a randomly selected individual, suppose the probabilities of the four blood types are P (type O) = 0.45, P (type A) = 0.4, P (type B) = 0.1 and P (type AB) = 0.05. A test is taken to determine the blood type, but the test is only able to declare that the blood type is either A or B. What is the probability that the blood type is A? Definition 18. We define P (E F ) to be the conditional probability of E given F, where P (E F ) := P (E F ), (3) P (F ) assuming P (F ) > 0. We can interpret this to mean If it is known that F has occurred, what is the probability that E has also occurred? If we know that the outcome belongs to the set F, then for E to occur also, the outcome must lie in the intersection E F. To get the conditional probability 26

of E F, we measure (using the probability measure P ) the fraction of the set F that is also in the set E. A comment on joint probabilitites Definition 18 gives us an intuitive way to think about joint probabilities P (E F ). Rearranging ((3)) we have and we can swap E and F and write P (E F ) = P (F )P (E F ), P (F E) = P (E F ) = P (E)P (F E). This means we can calculate the probability that both E and F occur by considering either 1. the probability that E occurs, and then the probability that F occurs given that E has occurred, or: 2. the probability that F occurs, and then the probability that E occurs given that F has occurred. Example 18. Visualising blood group example Specifying conditional probabilities directly Equation (3) tells us how to calculate P (E F ) if we already know P (E F ) and P (F ). But in some situations, we may be able to specify P (E F ) directly, given the information at hand. Example 19. A diagnostic test has been developed for a particular disease. In a group of patients known to be carrying the disease, the test successfully detected the disease for 95% of the patients. An individual is selected at 27

random from the population (and so may or may not be carrying the disease). Let D be the event that the individual has the disease, and T be the event that the test declares the individual has the disease. Using the frequency approach to specifying a probability and the information above, which of the following probabilities can we specify? P (D) P (D T ) P (T D) P (D T ) In some cases, the conditional probability will be obvious, and you should have the confidence just to write it down! Example 20. 1. A playing card is drawn at random from a standard deck of 52. Let A be the event that the card is a heart, and B be the event that the card is red. What are P (A B) and P (B A)? 2. On a National Lottery ticket (6 numbers chosen out of 59), let A be the event that 6th number matches the 6th number drawn, and B be the event that the first 5 numbers on the ticket match the first 5 numbers drawn. What is P (A B)? 7.2 Independence Definition 19. Two events E and F are said to be independent if P (E F ) = P (E)P (F ). 28

We can use the definition of conditional probability to give a more intuitive definition of independence. The events A and B are independent if P (E F ) = P (E). (4) (If (4) holds, then P (E F ) = P (E F )P (F ) = P (E)P (F )). We can read this to mean If E and F are independent, then learning that E has occurred does not change the probability that F will occur (and vice versa). Example 21. Suppose two pregnant women are chosen at random, and consider whether each gives birth to a boy or girl (assume there will be no twins, triplets etc.) We can write the sample space as S = {(boy, boy), (boy, girl), (girl, boy), (girl, girl)}, where, for example, element (boy, girl), is the outcome that the first woman gives birth to a boy, and the second woman gives birth to a girl. Define B i to be the event that the ith woman gives birth to a boy. 1. With regard to S, what are the subsets B 1, B 2 and B 1 B 2? 2. Suppose we assume that each outcome in the sample space is equally likely. What are the values of P (B 1 ), P (B 2 ) and P (B 1 B 2 )? 3. Are B 1 and B 2 independent? Example 22. Two playing cards are drawn at random from a standard 52 card deck. We observe the number of aces and the number of kings in the two cards drawn. We can write the sample space as S = {(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (2, 0)}, where, for example, (1, 0) corresponds to observing 1 ace and no kings. Let A be the event of at least one ace, and let K be the event of at least one king. 29

1. With regards to S, what are the subsets A, K and A K? 2. Assume that each card in the deck has the same chance of being selected. Calculate P (A), P (K) and P (A K). 3. Are A and K independent? 4. Compare P (A) with P (A K) and comment on the result. Calculating joint probabilities: a summary We have now seen various ways to calculate a joint probability P (A B). These are as follows. 1. Assume independence If we think that learning A has occurred will not change our probability of B occurring (and vice versa) then we have P (A B) = P (A)P (B). Example 23. If I buy a single National Lottery ticket, the probability I don t win any prize is 53/54. If I buy one ticket every week, what is the probability I win nothing in the first two weeks? What is the probability I win nothing in the first month? 2. Direct calculation using classical probability When using classical probability (assuming the elements of the sample space are equally likely), it may be straightforward to count in the number of outcomes in which both A and B occur, and hence calculate P (A B) directly. In a sense, we are saying that if we write C as the event A B, it is straightforward to calculate P (C). Example 24. One playing card is drawn at random from a standard deck of 52. Let A be the event that the card is a red face card (king, queen or jack). Let B be the event that the card is a king. 30

(a) Are A and B independent? (b) What is P (A B)? 3. Using conditional probability As has already been commented, we have P (A B) = P (A)P (B A) = P (B)P (A B). If we already know P (A) and P (B A), or P (B) and P (A B), we can calculate P (A B). Example 25. (Example 19 revisited). Suppose 1% of people in a population have a particular disease. A diagnostic test has a 95% chance of detecting the disease in a person carrying the disease. One person is selected at random. What is the probability that they have the disease and the diagnostic test detects it? Having three methods may seem confusing! In fact, the conditional probability method can be thought of as the way to calculate a joint probability (remembering that conditional probabilities can be specified directly), with the first two methods being short cuts or special cases. Example 26. Show how the conditional probability method can be used to calculate the joint probabilities in Examples 23 and 24. 7.3 The law of total probability In some situations, calculating a probability of an event is easiest if we first consider some appropriate conditional probabilities. 31

Example 27. Suppose the four teams in this year s Champions League semifinals are Manchester United, Barcelona, Milan, and Bayern Munich. Depending on their opponents, you judge Manchester United s probabilities of reaching the final to be P (reach final opponent is Barcelona) = 0.2, P (reach final opponent is Milan) = 0.4, P (reach final opponent is Bayern Munich) = 0.5. If the semi-final draw has yet to be made (with any two teams having the same probability of being drawn against each other), what is your probability of Manchester United reaching the final? Theorem 3. (The law of total probability) Suppose we have a partition E = {E 1,..., E n } of a sample space S. Then for any event F, n P (F ) = P (F E i ), or, equivalently, P (F ) = i=1 n P (F E i )P (E i ). i=1 7.4 Bayes theorem In example 19 we considered a diagnostic test, and the probability of the test detecting the disease in someone who has it. But diagnostic tests can sometimes produce false positives : a test may claim the presence of the disease in someone who does not have it. In these situations, we will want to know how likely it is someone has the disease, conditional on their test result. 32

1 0.9 0.8 S 0.7 E 1 0.6 0.5 E 5 F E 2 0.4 0.3 0.2 0.1 E 4 E 3 0 Figure 7: Given 0 a partition 0.2 E = {E 1 0.4,..., E n }, we0.6 can calculate0.8p (F ) by considering 1 in turn how likely each event E i is, and then how likely F is conditional on E i. Example 28. A new diagnostic test has been developed for a particular disease. It is known that 0.1% of people in the population have the disease. The test will detect the disease in 95% of all people who really do have the disease. However, there is also the possibility of a false positive ; out of all people who do not have the disease, the test will claim they do in 2% of cases. A person is chosen at random to take the test, and the result is positive. How likely is it that that person has the disease? Theorem 4. (Bayes theorem) Suppose we have a partition of E = {E 1,..., E n } of a sample space S. Then for any event F, P (E i F ) = P (E i)p (F E i ). P (F ) 33

Note that we can calculate P (F ) via the law of total probability: P (F ) = n P (F E j )P (E j ). j=1 Note that if E is a single event then {E, Ē} is a partition, so Bayes theorem gives us P (F E)P (E) P (E F ) = P (F E)P (E) + P (F Ē)P (Ē). In the context of Bayes theorem, we sometimes refer to P (E i ) as the prior probability of E i, and P (E i F ) as the posterior probability of E i given F. The prior probability states how likely we thought E i was before we knew that F had occurred, and the posterior probability states how likely we think E i is after we have learnt that F has occurred. 7.5 Conditional probabilities: deciding which formula to use Initially, it can seem confusing that we have two formulae for conditional probabilities: the definition in equation (3) and Bayes theorem (and we have also said that you can sometimes just write down a conditional probability directly). Firstly, you should note that these aren t really different formulae. To get Bayes theorem, we have just started with the conditional probability definition P (A B) P (A B) =, P (B) then rewritten the numerator using P (A B) = P (A)P (B A) and rewritten the denominator using the law of total probability. With practice, you will quickly learn to recognise what probabilities you already know, and so how 34

to calculate P (A B). However, to start with, you may find it helpful to use the following scheme. 1. Is it obvious? You may have the information you need to write down P (A B) directly. If not, carry on to Step 2. 2. Write down the conditional probability formula P (A B) = P (A B) P (B) 3. Consider the numerator P (A B) (a) Is it straightforward to write down or calculate P (A B)? If so, move on to Step 4. (b) Do you know P (A) and P (B A)? If so, you can calculate P (A B) P (A B) = P (A)P (B A). (In this case, you are now using Bayes Theorem). 4. Consider the denominator P (B) If you know this value already, you re done. If not, try the law of total probability: P (B) = P (A)P (B A) + P (Ā)P (B Ā). Try working through Examples 17, 20 and 28 again, following the scheme above. 35