Elements of probability theory

2 Elements of probability theory Probability theory provides mathematical models for random phenomena, that is, phenomena which under repeated observations yield di erent outcomes that cannot be predicted with certainty. 2.1 SAMPLE SPACES A situation whose outcomes occur randomly is called an experiment. The set of all possible outcomes of an experiment is called the sample space corresponding to the experiment, and is denoted by. A generic element of is called a sample point, or simply a point, and is denoted by! 2. Example 2.1 A coin is tossed twice and the sequence of heads (H) and tails (T) is recorded. The possible outcomes of this experiment are HH, HT, TH and TT. Hence, the sample space corresponding to this experiment consists of the four points = fhh;ht;th;ttg: A sample space is called nite if it is empty or contains a nite number of points, otherwise is called in nite. A sample space is called countable if its points can be indexed by the set of positive integers. A sample space that is nite or countable is called discrete. Example 2.2 A coin is tossed until H is recorded. The sample space corrsponding to this experiment is = fh;th;tth;ttth;tttth;:::g: Thus contains countably many points. 2 Not all sample spaces are discrete. For example, the sample space consisting of all positive real numbers is not discrete, neither is the sample space consisting of all real numbers in the interval [0; 1]. 2

8 2.2 RELATIONS AMONG EVENTS A subset of points in a sample space is called an event in. An event occurs if and only if one of its points occurs. Viewed as an event, is called the sure event. In general, events will be de ned by certain conditions on the points that compose them. Because events are just subsets of points in, concepts and results from point set theory apply to events. In particular, if A and B are events in, A implies B, written A µ B, if and only if all points in A also belong to B. The events A and B are identical, written A = B, if and only if A µ B and B µ A, that is, A and B contain exactly the same points. Other usual operations and relations between sets are listed below (union) A [ B = f! 2 :! 2 A or! 2 Bg, (intersection) A \ B = f! 2 :! 2 A and! 2 Bg, (complement) A c = f! 2 :! 62 Ag, (impossible event) ; = c, (di erence) A B = A \ B c, (symmetric di erence) A B = (A B) [ (B A). Instead of A \ B we also write AB. Operations and relations between sets are easy to visualize using Venn diagrams. If A \ B = ;, then A andb are called disjoint or mutually exclusive events. Intersections and unions of a countable collection A 1 ;A 2 ;A 3 ;::: of events are denoted by T 1 A i and S 1 A i, respectively. If A is an arbitrary family of subsets in, we write [ A2A A = f! 2 :! 2 A for some A 2 Ag and \ A2A A = f! 2 :! 2 A for all A 2 Ag Some basic relationships between events are: A [ ; = A, A [ =, A [ A = A, [ ; =, A \ ; = ;, A \ = A, A \ A = A, \ ; = ;. and

ELEMENTS OF PROBABILITY THEORY 9 (commutative law) A [ B = B [ A, A \ B = B \ A, (distributive law) A [ (B \ C) = (A [ B) \ (A [ C), A \ (B [ C) = (A \ B) [ (A \ C), (associative law) (A [ B) [ C = A [ (B [ C), (A \ B) \ C = A \ (B \ C), (De Morgan's laws) (A [ B) c = A c \ B c, (A \ B) c = A c [ B c. De Morgan's laws show that complementation, union and intersection are not independent operations. The commutative, distributive and associative laws and De Morgan's laws can easily be extended to a countable collection of events A 1 ;A 2 ;A 3 ;:::. The characteristic function (cf) of an event A µ is a function 1 A ( ) de ned for all! 2 by the relation ½ 1; if! 2 A, 1 A (!) = 0; otherwise. We also write 1(! 2 A) or, more compactly, 1(A). There is a one-to-one correspondence between sets and their cf's, and all properties of sets and set operations can be expressed in terms of cf's. For example, if C = A c then 1 C = 1 1 A, if C = A [ B then 1 C = max(1 A ; 1 B ), and if C = A \ B then 1 C = 1 A 1 B. 2.3 PROBABILITY How can we attach probabilities to events? The rst and easiest case is an experiment with a nite sample space consisting of N points. Suppose that, because of the nature of the experiment (e.g. tossing a fair coin), all points in are equiprobable, that is, equally likely, and let A be some event in. We de ne the probability of A, written P(A), as the ratio P(A) = N(A) N ; (2.1) where N(A) denotes the number of points in A. For any A µ, we have 0 P(A) 1; P( ) = N( ) N = 1; P(;) = N(;) N = 0: Further, if A and B are disjoint events in, then P(A [ B) = N(A [ B) N = N(A) N + N(B) N = P(A) + P(B):

10 Despite its simplicity, formula (2.1) can lead to non trivial calculations. In order to use it in a given problem, we need to determine: (i) the number N of all equiprobable outcomes, and (ii) the number of all those outcomes leading to the occurrence of A. A second case is whena basic experimentcan be repeated inexactly the same conditions any number n of times. We call this situation the case of independent trials under identical conditions. In this case, we can give a precise meaning to the concept of probability. In each trial a particular event A may or may not occur. Let n(a) be the number of trials in which A occurs. The relative frequency of the event A in the given series of n trials is de ned as f n (A) = n(a) n : It is an empirical fact that the f n (A) observed for di erent series of trials are virtually the same for large n, clustering about a constant value P(A), called the probability of A. Roughly speaking, the probability of A equals the fraction of trials leading to the occurrence of A in a large series of trials. 2.4 COMBINATORIAL RESULTS Whenever equal probabilities are assigned to the elements of a nite sample space, computation of probabilities of events reduces to counting the points comprising the events. Theorem 2.1 Given n elements a 1 ;:::;a n and m elements b 1 ;:::;b m there are exactly nm distinct ordered pairs (a i ;b j ) containing one element of each kind. Thus, if one experiment has n possible outcomes and another experiment has m possible outcomes, there are nm possible outcomes for the two expriments. More generally we have: Theorem 2.2 Given n 1 elements a 1 ;:::;a n1, n 2 elements b 1 ;:::;b n2, etc., up to n r elements x 1 ;:::;x nr, there are n 1 n 2 n r distinct ordered pairs (a i1 ;b i2 ;:::;x ir ) containing one element of each kind. Thus, if there are r experiments, where the rst has n 1 possible outcomes, the second n 2,..., and the rth n r possible outcomes, there are a total of n 1 n 2 n r possible outcomes for the r experiments. A permutation is an ordered arrangement of objects. An ordered sample of size r is a permutation of r objects obtained from a set ofn elements. Two possible ways for obtaining samples are: sampling with replacement and sampling without replacement. Notice that only samples of size r n without replacement are possible.

ELEMENTS OF PROBABILITY THEORY 11 Theorem 2.3 Given a set of n elements and sample size r, there are n r di erent ordered samples with replacement, and n(n 1)(n 2) (n r + 1) = di erent ordered samples without replacement. n! (n r)! Theorem 2.3 implies that the number of permutations or orderings of n elements is equal to n!. A combination is a set of elements without repetitions and without regard to ordering. For example, fa; bg and fb; ag are di erent permutations but only one combination. Thus, a combination is the number of unordered samples of a given size drawn without replacement from a nite set of objects. Theorem 2.4 The number of possible combinations of n objects taken r at a time (r n), is equal to µ Cr n n! n = r!(n r)! = : r Proof. Since the number of ordered samples is equal to the number of unordered samples times the number of ways to order each sample we have that n! (n r)! = Cn r r!; from which C n r = n! r!(n r)! : 2 The number Cr n is called binomial coe±cient, since it occurs in the binomial expansion nx µ n (a + b) n = a n r b r : r r=0 More generally, simple induction gives the following: Theorem 2.5 Given a set of n elements, let n 1 ;:::;n k be positive integers such that P k n i = n. Then there are µ n n! = (2.2) n 1 n 2 :::n k n 1!n 2!:::n k! ways of partitioning the set into k unorderd samples without replacement of size n 1 ;:::;n k respectively. The numbers (2.2) are called multinomial coe±cients.

12 2.5 FINITE PROBABILITY SPACES The de nition of probability in terms of equiprobable events is circular. On the other hand, de ning probabilities as limits of relative frequencies inindependent trials under identical conditions is far too restrictive. To avoid these problems we shall now present a purely axiomatic treatment of probabilities. De nition 2.1 A sample space is called a nite probability space if is nite and for every event A µ, there is de ned a real number P(A), called the probability of the event A, such that: A.1: P(A) 0; A.2: P( ) = 1; A.3: if A 1 and A 2 are mutually exclusive events in, then P(A [ B) = P(A) + P(B): It follows from De nition 2.1 that, for any subset A and B of, 0 P(A) 1; (2.3) 2 Further P(A c ) = 1 P(A); (2.4) P(;) = 0; (2.5) A µ B ) P(A) P(B): (2.6) P(A [ B) = P(A) + P(B) P(AB) (Covering theorem): (2.7) This implies the following upper bound on P(A [ B) P(A [ B) P(A) + P(B); with equality if and only if A and B are disjoint. Notice that B = AB[A c B, where AB and A c B are mutually exclusive events. Hence P(B) = P(AB) + P(A c B) and therefore P(B) P(AB) = P(A c B). Substituting in (2.7) gives P(A [ B) = P(A) + P(A c B) (Addition law): (2.8) Also notice that, by De Morgan's law and the Covering theorem, This implies 1 P(AB) = P((AB) c ) = P(A c [ B c ) P(A c ) + P(B c ): P(AB) 1 P(A c ) P(B c ) (Bonferroni inequality); (2.9)

ELEMENTS OF PROBABILITY THEORY 13 with equality if and only if A c and B c are disjoint. More generally, if A 1 ;:::;A n is a nite collection of events in, then A 1 ;A c 1A 2 ;:::;A c 1A c 2;:::;A c n 1A n form a partition of S n A i, and so n[ P( A i ) = P(A 1 ) + P(A c 1A 2 ) + + P(A c 1A c 2 A c n 1A n ): This result generalizes the Addition law (2.8). Since for all n 1, we also have A c 1A c 2 A c n 1A n µ A n n[ P( A i ) nx P(A i ): This result generalizes the Covering theorem (2.7). Finally, the generalization of the Bonferroni inequality (2.9) is n\ P( A i ) 1 nx P(A c i): 2.6 MEASURABLE SPACES AND MEASURES For in nite sample spaces, some modi cations of the axioms A.1-A.3 and some additional concepts of set theory are required. The reason is that some subsets of an in nite sample space may be so irrregular that it is not possible to assign a probability to them. A set whose elements are sets of will be called a class of sets in. When a set operation performed on sets in a class A gives as a result sets which also belong to A, we say that A is closed under the given operation. De nition 2.2 A nonempty class A of subsets of is called a eld or an algebra on if it contains and is closed under complementation and nite unions, that is, 2 A; A 2 A ) A c 2 A; (2.10) n[ A i 2 A; i = 1;:::;n ) A i 2 A: (2.11) By De Morgan's laws, (2.10) and (2.11) together imply A i 2 A; i = 1;:::;n ) n\ A i 2 A: 2

14 Thus, all standardset operations (union, intersection andcomplementation) can be performed any nite number of times on the elements of a eld A without obtaining a set not in A. De nition 2.3 A eld A on is a ¾- eld or a ¾-algebra if it is closed under countable unions, that is, if A i 2 A; i = 1; 2;::: ) A i 2 A: (2.12) By De Morgan's law, (2.10) and (2.12) together imply A i 2 A; i = 1; 2;::: ) 1\ A i 2 A: Thus, all standard set operations can be performed any countable number of times on the elements of a ¾- eld A without obtaining a set not in A. If A is a class of subsets of, the smallest eld (¾- eld) containing A is called the eld (¾- eld) generated by A. It can be veri ed that the eld (¾- eld) generated by A is equal to the intersection of all eld (¾- elds) containing A. If A is a ¾- eld on, the pair ( ;A) is called a measurable space. A subset A of is said to be measurable if A 2 A. Given a space, it is generally possible to de ne many ¾- elds on. To distinguish between them, the members of a given ¾- eld A on will be called A-measurable sets. Example 2.3 An important ¾- eld on the real line < is that generated by the class of all bounded semi-closed intervals of the form (a;b], 1 < a < b < 1. This ¾- eld is called the Borel eld on < and denoted by B. Its elements are called the Borel sets. Since B is a ¾- eld, repeated nite and countable set theoretic operations on its elements will never lead outside B. The measurable space (<;B) is called the Borel line. Notice that B would equivalently be generated by all the open half-lines of <, all the open intervals of <, or all the closed intervals of <. 2 A set function is a function de ned on a class of sets. De nition 2.4 A measure ¹ on a measurable space ( ;A) is a nonnegative set function¹de ned for all sets ofa and satisfying: M.1: ¹(;) = 0; M.2: (Countable additivity) if fa i g is any countable sequence of disjoint A- measurable sets, then ¹( A i ) = nx ¹(A i ): 2

ELEMENTS OF PROBABILITY THEORY 15 Clearly, countable additivity implies nite additivity, that is, if A 1 ;:::;A n is a nite collection of disjoint measurable sets, then n[ ¹( A i ) = nx ¹(A i ): Example 2.4 Let f be a nonnegative function of the points of a set. Let the ¾- eld A consist of all countable subsets of. A measure ¹ on ( ;A) is then de ned as nx ¹(;) = 0; ¹(f! 1 ;:::;! n g) = f(! i ): If f = 1, then ¹ is called counting measure. 2 It is easy to verify that if ¹ is a measure on ( ;A), then it is monotone, that is, ¹(A) ¹(B) whenever A;B 2 A and A ½ B. De nition 2.5 A measure ¹ on ( ;A) is called nite if ¹( ) < 1. It is called ¾- nite if there exists a sequence fa i g of sets in A such that S 1 A i = and ¹(A i ) < 1, n = 1; 2;:::. 2 Example 2.5 An important ¾- nite measure is the one de ned on the Borel line (<;B) by ¹((a;b]) = b a, the length of the interval (a;b]. Such a measure is called Lebesgue measure. It is easy to verify that every countable set is a Borel set of measure zero. 2 De nition 2.6 If ¹ is a measure on ( ;A), the triple ( ;A;¹) is called a measure space. 2 A measure space ( ;A;¹) is called complete if it contains all subsets of sets of measure zero, that is, if A 2 A, B ½ A, and ¹(A) = 0, then B 2 A. It can be shown that each measure space can be completed by the addition of subsets of sets of measure zero. If ¹ is a ¾- nite measure de ned on ( ;A) and F(A) is the ¾- eld generated by A, then it can be shown that there exists a unique measure ¹ on ( ;F(A)) such that ¹ (A) = ¹(A) for all A 2 A. Further, ¹ is also ¾- nite. Such a measure is called the extension of ¹. De nition 2.7 A measure space ( ;A;P) is a probability space if P is a ¾- nite measure with P( ) = 1. 2 2 2.7 PROBABILITY SPACES From De nition 2.7, a probability space is a triple ( ;A;P), where is the sample space associated with an experiment, A is a ¾- eld on, and the probability measure P is a real valued function de ned for all sets in A and satisfying:

16 P.1: P(A) 0 for all A 2 A; P.2: P( ) = 1; P.3: (Countable additivity): If fa i g is a countable sequence of disjoint subsets in A, then nx P( A i ) = P(A i ): If ( ;A;P) is a probability space, then the sets in A are interpreted as possible events associated with an experiment. For any A 2 A, the real number P(A) is called the probability of the event A. A support of P is any set A 2 A for which P(A) = 1. If is a nite sample space and A is the set of all the events in (the collection of all subsets of ), then properties P.1{P.3 are equivalent to A.1{A.3 that de ne a nite probability space. As a consequence of properties P.1{P.3, relationships (2.3){(2.9) hold for any A; B 2 A. Further, their generalizations hold for any nite collection of events in A. Notice that the Covering theorem (2.7) can be shown to hold for any countable collection of events in A. Further, if fa i g is a countable collection of events in A such that A 1 µ A 2 µ, then P( A i ) = lim n!1 P(A i): 2.8 CONDITIONAL PROBABILITY Let ( ;A;P) be a probability space and let B 2 A be an event such that P(B) > 0. If we know that B occurred, then the relevant sample space becomes B rather than. This justi es de ning the conditional probability of A given B as P(AjB) = P(AB) (2.13) P(B) if P(B) > 0, and P(A jb) = 0 if P(B) = 0. It is easy to verify that the function P( jb) de ned on A is a probability measure on ( ;A), that is, it satis es P.1{P.3. We call P( jb) the conditional probability measure given B. Notice that (2.13) can equivalently be written as P(AB) = P(AjB)P(B): This result, called the Multiplication law, provides a convenient way of nding P(AB) whenever P(AjB) and P(B) are easy to nd. The Multiplication law can be generalized to a nite collection of events A 1 ;:::;A n in A P(A 1 A n ) = P(A n ja n 1 A 1 )P(A n 1 A 1 ) = P(A n ja n 1 A 1 )P(A n 1 ja n 2 A 1 )P(A n 2 A 1 );

ELEMENTS OF PROBABILITY THEORY 17 and so on. Thus P(A 1 A n ) = P(A 1 )P(A 2 ja 1 )P(A 3 ja 2 A 1 ) P(A n ja n 1 A 1 ): Now consider a countable collection fb i g of disjoint events in A such that P(B i ) > 0 for every i and S 1 B i =. Clearly, For any A 2 A, where we used the fact that 1X P( B i ) = P(B i ) = 1: P(A) = P(A \ ( B i )) + P(A \ ( B i ) c ) = P(A \ ( B i )); P(( B i ) c ) = 1 P( B i ) = 0: Thus, by the Morgan's laws, 1X P(A) = P( AB i ) = P(AB i ); since AB i \ AB j = ; for all i 6= j. Therefore P(A) = 1X P(AjB i )P(B i ); which is called the Law of total probabilities. Now let A 2 A be such that P(A) > 0, and consider computing the conditional probability P(B j ja) given knowledge of fp(ajb i )g and fp(b i )g. By the de nition of conditional probability and the Multiplication law, P(B j ja) = P(B ja) P(A) = P(AjB j)p(b j ) P(A) for any xed j = 1; 2;:::. Therefore, by the Law of total probabilities, P(B j ja) = P(A jb j) P(B j ) P 1 P(AjB i)p(b i ) ; which is called Bayes rule.

18 2.9 INDEPENDENCE Let A;B 2 A be two events with non-zero probability. If knowing that B occurred gives no information about whether or not A occurred, then the probability assigned to A should not be modi ed by the knowledge that B occurred. Hence P(AjB) = P(A); and so P(AB) = P(A)P(B): (2.14) Two events A;B 2 A are said to be (pairwise) independent if (2.14) holds. Notice that this de nition of independence is symmetric in A and B, and also covers the case when P(A) = 0 or P(B) = 0. It is easy to show that if A and B are independent, then A and B c as well as A c and B c are independent. Three events A;B;C 2 A are said to be (mutually) independent if they are pairwise independent and P(ABC) = P(A)P(B) P(C): This condition is necessary, for pairwise independence does not ensure that, for example, P((AB)C) = P(AB)P(C). It is easy to verify that if A;B and C are independent events, then A[B and C are independent, and A \ B and C are independent. More generally, a family A of events are (mutually) independent if, for every nite collection A 1 ;:::;A n of events in A, REFERENCES n\ P( A i ) = ny P(A i ): (2.15) Billingsley P. (1979) Probability and Measure, Wiley, New York. Feller, W. (1968) An Introduction to Probability Theory and Its Applications (3rd ed.), Vol. 1, Wiley, New York. Halmos, P.R. (1974)Measure Theory, Springer, New York. Kolmogorov, A.N. and Fomin S.V. (1970) Introductory Real Analysis, Dover, New York. Loµeve, M. (1977)Probability Theory (4th ed.), Vol. 1, Springer, New York. Royden H.L. (1968)Real Analysis (2nd ed.), MacMillan, New York.