M2S1 Lecture Notes. G. A. Young http://www2.imperial.ac.uk/ ayoung



Similar documents
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

E3: PROBABILITY AND STATISTICS lecture notes

Basic Probability Concepts

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

ST 371 (IV): Discrete Random Variables

MULTIVARIATE PROBABILITY DISTRIBUTIONS

1 if 1 x 0 1 if 0 x 1

Section 6.1 Joint Distribution Functions

Section 1.3 P 1 = 1 2. = P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., =

Elements of probability theory

Continued Fractions and the Euclidean Algorithm

5. Continuous Random Variables

Chapter 4 Lecture Notes

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Probability Generating Functions

Random variables, probability distributions, binomial random variable

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Principle of Data Reduction

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Lecture 6: Discrete & Continuous Probability and Random Variables

Section 5.1 Continuous Random Variables: Introduction

MATH 425, PRACTICE FINAL EXAM SOLUTIONS.

LS.6 Solution Matrices

BANACH AND HILBERT SPACE REVIEW

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

1 Prior Probability and Posterior Probability

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

TOPIC 4: DERIVATIVES

Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University

Inner Product Spaces

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Introduction to Probability

Maximum Likelihood Estimation

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability

The Exponential Distribution

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

Vector and Matrix Norms

Chapter 3 RANDOM VARIATE GENERATION

Chapter 4. Probability and Probability Distributions

Transformations and Expectations of random variables

4.5 Linear Dependence and Linear Independence

Markov random fields and Gibbs measures

DIFFERENTIABILITY OF COMPLEX FUNCTIONS. Contents

Mathematical Methods of Engineering Analysis

Statistics 100A Homework 8 Solutions

Notes on Continuous Random Variables

A Tutorial on Probability Theory

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Lecture 7: Continuous Random Variables

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Mathematics Course 111: Algebra I Part IV: Vector Spaces

1 Norms and Vector Spaces

Notes on Determinant

Scalar Valued Functions of Several Variables; the Gradient Vector

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

Sample Induction Proofs

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 14 10/27/2008 MOMENT GENERATING FUNCTIONS

Statistics 100A Homework 7 Solutions

Efficiency and the Cramér-Rao Inequality

An Introduction to Basic Statistics and Probability

SOLUTIONS TO EXERCISES FOR. MATHEMATICS 205A Part 3. Spaces with special properties

FUNCTIONAL ANALYSIS LECTURE NOTES: QUOTIENT SPACES

Probability and statistics; Rehearsal for pattern recognition

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

So let us begin our quest to find the holy grail of real analysis.

Lecture Notes on Elasticity of Substitution

Some probability and statistics

Nonparametric adaptive age replacement with a one-cycle criterion

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Metric Spaces. Chapter Metrics

3. INNER PRODUCT SPACES

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Linear Algebra I. Ronald van Luijk, 2012

Lecture Notes 1. Brief Review of Basic Probability

Math 4310 Handout - Quotient Vector Spaces

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

The Heat Equation. Lectures INF2320 p. 1/88

Microeconomic Theory: Basic Math Concepts

Basics of Statistical Machine Learning

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Math 431 An Introduction to Probability. Final Exam Solutions

6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Math/Stats 425 Introduction to Probability. 1. Uncertainty and the axioms of probability

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Sums of Independent Random Variables

Math 120 Final Exam Practice Problems, Form: A

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Lectures 5-6: Taylor Series

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

STAT 830 Convergence in Distribution

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

Lies My Calculator and Computer Told Me

1. Prove that the empty set is a subset of every set.

Real Roots of Univariate Polynomials with Real Coefficients

Transcription:

M2S1 Lecture Notes G. A. Young http://www2.imperial.ac.uk/ ayoung September 2011

ii

Contents 1 DEFINITIONS, TERMINOLOGY, NOTATION 1 1.1 EVENTS AND THE SAMPLE SPACE......................... 1 1.1.1 OPERATIONS IN SET THEORY........................ 2 1.1.2 MUTUALLY EXCLUSIVE EVENTS AND PARTITIONS........... 2 1.2 THE σ-field....................................... 3 1.3 THE PROBABILITY FUNCTION............................ 4 1.4 PROPERTIES OF P(.): THE AXIOMS OF PROBABILITY............. 5 1.5 CONDITIONAL PROBABILITY............................. 6 1.6 THE THEOREM OF TOTAL PROBABILITY..................... 7 1.7 BAYES THEOREM.................................... 7 1.8 COUNTING TECHNIQUES............................... 8 1.8.1 THE MULTIPLICATION PRINCIPLE..................... 8 1.8.2 SAMPLING FROM A FINITE POPULATION................. 8 1.8.3 PERMUTATIONS AND COMBINATIONS................... 9 1.8.4 PROBABILITY CALCULATIONS........................ 10 2 RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS 13 2.1 RANDOM VARIABLES & PROBABILITY MODELS................. 13 2.2 DISCRETE RANDOM VARIABLES........................... 14 2.2.1 PROPERTIES OF MASS FUNCTION f X................... 14 2.2.2 CONNECTION BETWEEN F X AND f X.................... 15 2.2.3 PROPERTIES OF DISCRETE CDF F X.................... 15 2.3 CONTINUOUS RANDOM VARIABLES........................ 15 2.3.1 PROPERTIES OF CONTINUOUS F X AND f X................ 16 2.4 EXPECTATIONS AND THEIR PROPERTIES..................... 19 2.5 INDICATOR VARIABLES................................ 20 2.6 TRANSFORMATIONS OF RANDOM VARIABLES.................. 21 2.6.1 GENERAL TRANSFORMATIONS....................... 21 2.6.2 1-1 TRANSFORMATIONS............................ 24 2.7 GENERATING FUNCTIONS............................... 26 2.7.1 MOMENT GENERATING FUNCTIONS.................... 26 2.7.2 KEY PROPERTIES OF MGFS......................... 26 2.7.3 OTHER GENERATING FUNCTIONS..................... 28 2.8 JOINT PROBABILITY DISTRIBUTIONS....................... 29 2.8.1 THE CHAIN RULE FOR RANDOM VARIABLES.............. 33 2.8.2 CONDITIONAL EXPECTATION AND ITERATED EXPECTATION... 33 2.9 MULTIVARIATE TRANSFORMATIONS........................ 34 2.10 MULTIVARIATE EXPECTATIONS AND COVARIANCE.............. 37 2.10.1 EXPECTATION WITH RESPECT TO JOINT DISTRIBUTIONS...... 37 2.10.2 COVARIANCE AND CORRELATION..................... 37 2.10.3 JOINT MOMENT GENERATING FUNCTION................ 39 2.10.4 FURTHER RESULTS ON INDEPENDENCE................. 39 2.11 ORDER STATISTICS................................... 40 iii

iv CONTENTS 3 DISCRETE PROBABILITY DISTRIBUTIONS 41 4 CONTINUOUS PROBABILITY DISTRIBUTIONS 45 5 MULTIVARIATE PROBABILITY DISTRIBUTIONS 51 5.1 THE MULTINOMIAL DISTRIBUTION......................... 51 5.2 THE DIRICHLET DISTRIBUTION........................... 51 5.3 THE MULTIVARIATE NORMAL DISTRIBUTION.................. 52 6 PROBABILITY RESULTS & LIMIT THEOREMS 55 6.1 BOUNDS ON PROBABILITIES BASED ON MOMENTS.............. 55 6.2 THE CENTRAL LIMIT THEOREM........................... 56 6.3 MODES OF STOCHASTIC CONVERGENCE..................... 57 6.3.1 CONVERGENCE IN DISTRIBUTION..................... 57 6.3.2 CONVERGENCE IN PROBABILITY...................... 58 6.3.3 CONVERGENCE IN QUADRATIC MEAN.................. 59 7 STATISTICAL ANALYSIS 61 7.1 STATISTICAL SUMMARIES............................... 61 7.2 SAMPLING DISTRIBUTIONS.............................. 61 7.3 HYPOTHESIS TESTING................................. 63 7.3.1 TESTING FOR NORMAL SAMPLES - THE Z-TEST............ 63 7.3.2 HYPOTHESIS TESTING TERMINOLOGY.................. 64 7.3.3 THE t-test................................... 65 7.3.4 TEST FOR σ.................................... 65 7.3.5 TWO SAMPLE TESTS.............................. 66 7.4 POINT ESTIMATION................................... 68 7.4.1 ESTIMATION TECHNIQUES I: METHOD OF MOMENTS......... 68 7.4.2 ESTIMATION TECHNIQUES II: MAXIMUM LIKELIHOOD........ 69 7.5 INTERVAL ESTIMATION................................ 70 7.5.1 PIVOTAL QUANTITY.............................. 70 7.5.2 INVERTING A TEST STATISTIC........................ 71

CHAPTER 1 DEFINITIONS, TERMINOLOGY, NOTATION 1.1 EVENTS AND THE SAMPLE SPACE Definition 1.1.1 An experiment is a one-off or repeatable process or procedure for which (a) there is a well-defined set of possible outcomes (b) the actual outcome is not known with certainty. Definition 1.1.2 A sample outcome, ω, is precisely one of the possible outcomes of an experiment. Definition 1.1.3 The sample space, Ω, of an experiment is the set of all possible outcomes. NOTE : Ω is a set in the mathematical sense, so set theory notation can be used. For example, if the sample outcomes are denoted ω 1,..., ω k, say, then and ω i Ω for i = 1,..., k. The sample space of an experiment can be Ω = {ω 1,..., ω k } = {ω i : i = 1,..., k}, - a FINITE list of sample outcomes, {ω 1,..., ω k } - a (countably) INFINITE list of sample outcomes, {ω 1, ω 2,...} - an INTERVAL or REGION of a real space, { ω : ω A R d} Definition 1.1.4 An event, E, is a designated collection of sample outcomes. Event E occurs if the actual outcome of the experiment is one of this collection. An event is, therefore, a subset of the sample space Ω. Special Cases of Events The event corresponding to the collection of all sample outcomes is Ω. The event corresponding to a collection of none of the sample outcomes is denoted. i.e. The sets and Ω are also events, termed the impossible and the certain event respectively, and for any event E, E Ω. 1

2 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION 1.1.1 OPERATIONS IN SET THEORY Since events are subsets of Ω, set theory operations are used to manipulate events in probability theory. Consider events E, F Ω. Then we can reasonably concern ourselves also with events obtained from the three basic set operations: UNION E F E or F or both occur INTERSECTION E F both E and F occur COMPLEMENT E E does not occur Properties of Union/Intersection operators Consider events E, F, G Ω. COMMUTATIVITY ASSOCIATIVITY E F = F E E F = F E E (F G) = (E F ) G E (F G) = (E F ) G DISTRIBUTIVITY E (F G) = (E F ) (E G) E (F G) = (E F ) (E G) DE MORGAN S LAWS (E F ) = E F (E F ) = E F 1.1.2 MUTUALLY EXCLUSIVE EVENTS AND PARTITIONS Definition 1.1.5 Events E and F are mutually exclusive if E F =, that is, if events E and F cannot both occur. If the sets of sample outcomes represented by E and F are disjoint (have no common element), then E and F are mutually exclusive. Definition 1.1.6 Events E 1,..., E k Ω form a partition of event F Ω if (a) E i E j = for i j, i, j = 1,..., k (b) k E i = F, so that each element of the collection of sample outcomes corresponding to event F is in one and only one of the collections corresponding to events E 1,...E k.

1.2. THE σ-field 3 Figure 1.1: Partition of Ω In Figure 1.1, we have Ω = 6 E i Figure 1.2: Partition of F Ω In Figure 1.2, we have F = 6 (F E i ), but, for example, F E 6 =. 1.2 THE σ-field Events are subsets of Ω, but need all subsets of Ω be events? The answer is negative. But it suffices to think of the collection of events as a subcollection A of the set of all subsets of Ω. This subcollection should have the following properties: (a) if A, B A then A B A and A B A; (b) if A A then A A;

4 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION (c) A. A collection A of subsets of Ω which satisfies these three conditions is called a field. It follows from the properties of a field that if A 1, A 2,..., A k A, then k A i A. So, A is closed under finite unions and hence under finite intersections also. To see this note that if A 1, A 2 A, then A 1, A 2 A = A 1 A 2 A = ( A 1 A 2) A = A1 A 2 A. This is fine when Ω is a finite set, but we require slightly more to deal with the common situation when Ω is infinite. We require the collection of events to be closed under the operation of taking countable unions, not just finite unions. Definition 1.2.1 A collection A of subsets of Ω is called a σ field if it satisfies the following conditions: (I) A; (II) if A 1, A 2,... A then A i A; (III) if A A then A A. To recap, with any experiment we may associate a pair (Ω, A), where Ω is the set of all possible outcomes (or elementary events) and A is a σ field of subsets of Ω, which contains all the events in whose occurrences we may be interested. So, from now on, to call a set A an event is equivalent to asserting that A belongs to the σ field in question. 1.3 THE PROBABILITY FUNCTION Definition 1.3.1 For an event E Ω, the probability that E occurs will be written P (E). Interpretation: P (.) is a set-function that assigns weight to collections of possible outcomes of an experiment. There are many ways to think about precisely how this assignment is achieved; CLASSICAL : Consider equally likely sample outcomes... FREQUENTIST : Consider long-run relative frequencies... SUBJECTIVE : Consider personal degree of belief... or merely think of P (.) as a set-function. Formally, we have the following definition. Definition 1.3.2 A probability function P (.) on (Ω, A) is a function P : A [0, 1] satisfying:

1.4. PROPERTIES OF P(.): THE AXIOMS OF PROBABILITY 5 (a) P ( ) = 0, P (Ω) = 1; (b) if A 1, A 2,... is a collection of disjoint members of A, so that A i A j = from all pairs i, j with i j, then ( ) P A i = P (A i ). The triple (Ω, A, P (.)), consisting of a set Ω, a σ-field A of subsets of Ω and a probability function P (.) on (Ω, A) is called a probability space. 1.4 PROPERTIES OF P(.): THE AXIOMS OF PROBABILITY For events E, F Ω 1. P (E ) = 1 P (E). 2. If E F, then P (E) P (F ). 3. In general, P (E F ) = P (E) + P (F ) P (E F ). 4. P (E F ) = P (E) P (E F ) 5. P (E F ) P (E) + P (F ). 6. P (E F ) P (E) + P (F ) 1. NOTE : The general addition rule 3 for probabilities and Boole s Inequalities 5 and 6 extend to more than two events. Let E 1,..., E n be events in Ω. Then ( n ) P E i = P (E i ) P (E i E j ) + ( n ) P (E i E j E k )... + ( 1) n P E i i i<j i<j<k and ( n ) P E i n P (E i ). To prove these results, construct the events F 1 = E 1 and F i = E i ( i 1 k=1 E k ) for i = 2, 3,..., n. Then F 1, F 2,...F n are disjoint, and n E i = n F i, so ( n ) ( n ) P E i = P F i = n P (F i ).

6 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION Now, by property 4 above P (F i ) = P (E i ) P = P (E i ) P ( ( i 1 )) E i E k, i = 2, 3,..., n, k=1 ( i 1 k=1 ) (E i E k ) and the result follows by recursive expansion of the second term for i = 2, 3,...n. NOTE : We will often deal with both probabilities of single events, and also probabilities for intersection events. For convenience, and to reflect connections with distribution theory that will be presented in Chapter 2, we will use the following terminology; for events E and F P (E) is the marginal probability of E P (E F ) is the joint probability of E and F 1.5 CONDITIONAL PROBABILITY Definition 1.5.1 For events E, F Ω the conditional probability that F occurs given that E occurs is written P(F E), and is defined by if P(E) > 0. P (F E) = P (E F ), P (E) NOTE: P (E F ) = P (E)P (F E), and in general, for events E 1,..., E k, ( k ) P E i = P (E 1 )P (E 2 E 1 )P (E 2 E 1 E 2 )...P (E k E 1 E 2... E k 1 ). This result is known as the CHAIN or MULTIPLICATION RULE. Definition 1.5.2 Events E and F are independent if P (E F ) = P (E), so that P (E F ) = P (E)P (F ). Extension : Events E 1,..., E k are independent if, for every subset of events of size l k, indexed by {i 1,..., i l }, say, l P l = P (E ij ). E ij j=1 j=1

1.6. THE THEOREM OF TOTAL PROBABILITY 7 1.6 THE THEOREM OF TOTAL PROBABILITY THEOREM Let E 1,..., E k be a (finite) partition of Ω, and let F Ω. Then P (F ) = k P (F E i )P (E i ). PROOF E 1,..., E k form a partition of Ω, and F Ω, so F = (F E 1 )... (F E k ) = P (F ) = k P (F E i ) = k P (F E i )P (E i ), writing F as a disjoint union and using the definition of a probability function. Extension: The theorem still holds if E 1, E 2,... is a (countably) infinite a partition of Ω, and F Ω, so that P (F ) = P (F E i ) = P (F E i )P (E i ), if P(E i ) > 0 for all i. 1.7 BAYES THEOREM THEOREM Suppose E, F Ω, with P (E), P (F ) > 0. Then P (E F ) = P (F E)P (E). P (F ) PROOF P (E F )P (F ) = P (E F ) = P (F E)P (E), so P (E F )P (F ) = P (F E)P (E). Extension: If E 1,..., E k are disjoint, with P (E i ) > 0 for i = 1,..., k, and form a partition of F Ω, then P (E i F ) = P (F E i)p (E i ). k P (F E j )P (E j ) j=1 NOTE: in general, P (E F ) P (F E).

8 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION 1.8 COUNTING TECHNIQUES This section is included for completeness, but is not examinable. Suppose that an experiment has N equally likely sample outcomes. If event E corresponds to a collection of sample outcomes of size n(e), then P (E) = n(e) N, so it is necessary to be able to evaluate n(e) and N in practice. 1.8.1 THE MULTIPLICATION PRINCIPLE If operations labelled 1,..., r can be carried out in n 1,..., n r ways respectively, then there are r n i = n 1...n r ways of carrying out the r operations in total. Example 1.1 If each of r trials of an experiment has N possible outcomes, then there are N r possible sequences of outcomes in total. For example: (i) If a multiple choice exam has 20 questions, each of which has 5 possible answers, then there are 5 20 different ways of completing the exam. (ii) There are 2 m subsets of m elements (as each element is either in the subset, or not in the subset, which is equivalent to m trials each with two outcomes). 1.8.2 SAMPLING FROM A FINITE POPULATION Consider a collection of N items, and a sequence of operations labelled 1,..., r such that the ith operation involves selecting one of the items remaining after the first i 1 operations have been carried out. Let n i denote the number of ways of carrying out the ith operation, for i = 1,..., r. Then there are two distinct cases; (a) Sampling with replacement : an item is returned to the collection after selection. Then n i = N for all i, and there are N r ways of carrying out the r operations. (b) Sampling without replacement : an item is not returned to the collection after selected. Then n i = N i + 1, and there are N(N 1)...(N r + 1) ways of carrying out the r operations. e.g. Consider selecting 5 cards from 52. Then (a) leads to 52 5 possible selections, whereas (b) leads to 52.51.50.49.48 possible selections. NOTE : The order in which the operations are carried out may be important e.g. in a raffle with three prizes and 100 tickets, the draw {45, 19, 76} is different from {19, 76, 45}.

1.8. COUNTING TECHNIQUES 9 NOTE : The items may be distinct (unique in the collection), or indistinct (of a unique type in the collection, but not unique individually). e.g. The numbered balls in the National Lottery, or individual playing cards, are distinct. However when balls in the lottery are regarded as WINNING or NOT WINNING, or playing cards are regarded in terms of their suit only, they are indistinct. 1.8.3 PERMUTATIONS AND COMBINATIONS Definition 1.8.1 A permutation is an ordered arrangement of a set of items. A combination is an unordered arrangement of a set of items. RESULT 1 The number of permutations of n distinct items is n! = n(n 1)...1. RESULT 2 The number of permutations of r from n distinct items is P n r = n! = n(n 1)...(n r + 1) (by the Multiplication Principle). (n r)! If the order in which items are selected is not important, then RESULT 3 The number of combinations of r from n distinct items is the binomial coefficient ( ) n Cr n n! = = (as Pr n = r!cr n ). r r!(n r)! -recall the Binomial Theorem, namely (a + b) n = n i=0 ( ) n a i b n i. i Then the number of subsets of m items can be calculated as follows; for each 0 j m, choose a subset of j items from m. Then m ( ) m Total number of subsets = = (1 + 1) m = 2 m. j j=0 If the items are indistinct, but each is of a unique type, say Type I,..., Type κ say, (the so-called Urn Model) then RESULT 4 The number of distinguishable permutations of n indistinct objects, comprising n i items of type i for i = 1,..., κ is n! n 1!n 2!...n κ!. Special Case : if κ = 2, then the number of distinguishable permutations of the n 1 objects of type I, and n 2 = n n 1 objects of type II is C n n 2 = n! n 1!(n n 1 )!.

10 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION RESULT 5 There are Cr n cell and n r in the other. ways of partitioning n distinct items into two cells, with r in one 1.8.4 PROBABILITY CALCULATIONS Recall that if an experiment has N equally likely sample outcomes, and event E corresponds to a collection of sample outcomes of size n(e), then P (E) = n(e) N. Example 1.2 A True/False exam has 20 questions. Let E = 16 answers correct at random. Then ( ) 20 P (E) = Number of ways of getting 16 out of 20 correct Total number of ways of answering 20 questions = 16 2 20 = 0.0046. Example 1.3 Sampling without replacement. Consider an Urn Model with 10 Type I objects and 20 Type II objects, and an experiment involving sampling five objects without replacement. Let E= precisely 2 Type I objects selected We need to calculate N and n(e) in order to calculate P(E). In this case N is the number of ways of choosing 5 from 30 items, and hence ( ) 30 N =. 5 To calculate n(e), we think of E occurring by first choosing 2 Type I objects from 10, and then choosing 3 Type II objects from 20, and hence, by the multiplication rule, ( )( ) 10 20 n(e) =. 2 3 Therefore ( )( ) 10 20 P (E) = 2 3 ( ) 30 = 0.360. 5 This result can be checked using a conditional probability argument; consider event F E, where F = sequence of objects 11222 obtained. Then F = 5 F ij where F ij = type j object obtained on draw i i = 1,..., 5, j = 1, 2. Then P (F ) = P (F 11 )P (F 21 F 11 )...P (F 52 F 11, F 21, F 32, F 42 ) = 10 9 20 19 18 30 29 28 27 26.

1.8. COUNTING TECHNIQUES 11 Now consider event G where G = sequence of objects 12122 obtained. P (G) = 10 20 9 19 18 30 29 28 27 26, Then i.e. P (G) = P (F ). In fact, ( any ) sequence containing two Type I and three Type II objects has this 5 probability, and there are such sequences. Thus, as all such sequences are mutually exclusive, 2 ( )( ) 10 20 as before. P (E) = ( ) 5 10 2 30 9 20 19 18 29 28 27 26 = 2 3 ( ) 30 5 Example 1.4 Sampling with replacement. Consider an Urn Model with 10 Type I objects and 20 Type II objects, and an experiment involving sampling five objects with replacement. Let E = precisely 2 Type I objects selected. Again, we need to calculate N and n(e) in order to calculate P(E). In this case N is the number of ways of choosing 5 from 30 items with replacement, and hence N = 30 5. To calculate n(e), we think of E occurring by first choosing 2 Type I objects from 10, and 3 Type II objects from 20 in any order. Consider such sequences of selection Sequence Number of ways 11222 10.10.20.20.20 12122 10.20.10.20.20.. etc., and thus a sequence with ( ) 2 Type I objects and 3 Type II objects can be obtained in 10 2 20 3 5 ways. As before there are such sequences, and thus 2 ( ) 5 10 2 20 3 2 P (E) = 30 5 = 0.329. Again, this result can be verified using a conditional probability argument; consider event F E, where F = sequence of objects 11222 obtained. Then ( ) 10 2 ( ) 20 3 P (F ) = 30 30 as the results of the draws are independent. This ( ) result is true for any sequence containing two 5 Type I and three Type II objects, and there are such sequences that are mutually exclusive, 2 so ( ) ( ) 5 10 2 ( ) 20 3 P (E) =. 2 30 30

12 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION

CHAPTER 2 RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS This chapter contains the introduction of random variables as a technical device to enable the general specification of probability distributions in one and many dimensions to be made. The key topics and techniques introduced in this chapter include the following: EXPECTATION TRANSFORMATION STANDARDIZATION GENERATING FUNCTIONS JOINT MODELLING MARGINALIZATION MULTIVARIATE TRANSFORMATION MULTIVARIATE EXPECTATION & COVARIANCE SUMS OF VARIABLES Of key importance is the moment generating function, which is a standard device for identification of probability distributions. Transformations are often used to transform a random variable or statistic of interest to one of simpler form, whose probability distribution is more convenient to work with. Standardization is often a key part of such simplification. 2.1 RANDOM VARIABLES & PROBABILITY MODELS We are not always interested in an experiment itself, but rather in some consequence of its random outcome. Such consequences, when real valued, may be thought of as functions which map Ω to R, and these functions are called random variables. Definition 2.1.1 A random variable (r.v.) X is a function X : Ω R with the property that {ω Ω : X(ω) x} A for each x R. The point is that we defined the probability function P (.) on the σ field A, so if A(x) = {ω Ω : X(ω) x}, we cannot discuss P (A(x)) unless A(x) belongs to A. We generally pay no attention to the technical condition in the definition, and just think of random variables as functions mapping Ω to R. 13

14 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS So, we regard a set B R as an event, associated with event A Ω if A = {ω : X(ω) = x for some x B}. A and B are events in different spaces, but are equivalent in the sense that P (X B) = P (A), where, formally, it is the latter quantity that is defined by the probability function. switches to assigning the probability P (X B) for appropriate sets B R. Attention If Ω is a list of discrete elements Ω = {ω 1, ω 2,...}, then the definition indicates that the events of interest will be of the form [X = b], or equivalently of the form [X b] for b R. For more general sample spaces, we will concentrate on events of the form [X b] for b R. 2.2 DISCRETE RANDOM VARIABLES Definition 2.2.1 A random variable X is discrete if the set of all possible values of X (that is, the range of the function represented by X), denoted X, is countable, that is X = {x 1, x 2,..., x n } [FINITE] or X = {x 1, x 2,...} [INFINITE]. Definition 2.2.2 PROBABILITY MASS FUNCTION The function f X defined on X by f X (x) = P [X = x], x X that assigns probability to each x X is the (discrete) probability mass function, or pmf. NOTE: For completeness, we define f X (x) = 0, x / X, so that f X is defined for all x R Furthermore we will refer to X as the support of random variable X, that is, the set of x R such that f X (x) > 0. 2.2.1 PROPERTIES OF MASS FUNCTION f X Elementary properties of the mass function are straightforward to establish using properties of the probability function. A function f X is a probability mass function for discrete random variable X with range X of the form {x 1, x 2,...} if and only if (i) f X (x i ) 0, (ii) f X (x i ) = 1. These results follow as events [X = x 1 ], [X = x 2 ] etc. are equivalent to events that partition Ω, that is, [X = x i ] is equivalent to event A i hence P [X = x i ] = P (A i ), and the two parts of the theorem follow immediately. Definition 2.2.3 DISCRETE CUMULATIVE DISTRIBUTION FUNCTION The cumulative distribution function, or cdf, F X of a discrete r.v. X is defined by F X (x) = P [X x], x R.

2.3. CONTINUOUS RANDOM VARIABLES 15 2.2.2 CONNECTION BETWEEN F X AND f X Let X be a discrete random variable with range X = {x 1, x 2,...}, where x 1 < x 2 <..., and probability mass function f X and cdf F X. Then for any real value x, if x < x 1, then F X (x) = 0, and for x x 1, F X (x) = x i xf X (x i ) f X (x i ) = F X (x i ) F X (x i 1 ) i = 2, 3,... with, for completeness, f X (x 1 ) = F X (x 1 ). These relationships follow as events of the form [X x i ] can be represented as countable unions of the events A i. The first result therefore follows from properties of the probability function. The second result follows immediately. 2.2.3 PROPERTIES OF DISCRETE CDF F X (i) In the limiting cases, lim F X(x) = 0, x lim F X(x) = 1. x (ii) F X is continuous from the right (but not continuous) on R that is, for x R, (iii) F X is non-decreasing, that is lim h 0 +F X(x + h) = F X (x). a < b = F X (a) F X (b). (iv) For a < b, P [a < X b] = F X (b) F X (a). The key idea is that the functions f X and/or F X can be used to describe the probability distribution of the random variable X. A graph of the function f X is non-zero only at the elements of X. A graph of the function F X is a step-function which takes the value zero at minus infinity, the value one at infinity, and is non-decreasing with points of discontinuity at the elements of X. 2.3 CONTINUOUS RANDOM VARIABLES Definition 2.3.1 A random variable X is continuous if the function F X defined on R by F X (x) = P [X x] for x R is a continuous function on R, that is, for x R, lim h 0 F X(x + h) = F X (x). Definition 2.3.2 CONTINUOUS CUMULATIVE DISTRIBUTION FUNCTION The cumulative distribution function, or cdf, F X of a continuous r.v. X is defined by F X (x) = P [X x], x R.

16 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS Definition 2.3.3 PROBABILITY DENSITY FUNCTION A random variable is absolutely continuous if the cumulative distribution function F X can be written F X (x) = x f X (t)dt for some function f X, termed the probability density function, or pdf, of X. From now on when we speak of a continuous random variable, we will implicitly assume the absolutely continuous case, where a pdf exists. 2.3.1 PROPERTIES OF CONTINUOUS F X AND f X By analogy with the discrete case, let X be the range of X, so that X = {x : f X (x) > 0}. (i) The pdf f X need not exist, but as indicated above, continuous r.v. s where a pdf f X cannot be defined in this way will be ignored. The function f X can be defined piecewise on intervals of R. (ii) For the cdf of a continuous r.v., lim F X(x) = 0, x lim F X(x) = 1. x (iii) Directly from the definition, at values of x where F X is differentiable, f X (x) = d dt {F X(t)} t=x. (iv) If X is continuous, f X (x) P [X = x] = lim (X x) P (X x h)] = lim h 0 +[P h 0 +[F X(x) F X (x h)] = 0. (v) For a < b, P [a < X b] = P [a X < b] = P [a X b] = P [a < X < b] = F X (b) F X (a). It follows that a function f X is a pdf for a continuous random variable X if and only if (i) f X (x) 0, (ii) This result follows direct from definitions and properties of F X. f X (x)dx = 1. Example 2.1 Consider a coin tossing experiment where a fair coin is tossed repeatedly under identical experimental conditions, with the sequence of tosses independent, until a Head is obtained. For this experiment, the sample space, Ω is then the set of sequences ({H}, {T H}, {T T H}, {T T T H}...) with associated probabilities 1/2, 1/4, 1/8, 1/16,.... Define discrete random variable X : Ω R, by X(ω) = x first H on toss x. Then ( ) 1 x f X (x) = P [X = x] =, x = 1, 2, 3,... 2

2.3. CONTINUOUS RANDOM VARIABLES 17 and zero otherwise. For x 1, let k(x) be the largest integer not greater than x, then and F X (x) = 0 for x < 1. k(x) F X (x) = f X (x i ) = f X (i) = 1 x i x ( ) 1 k(x) 2 Graphs of the probability mass function (left) and cumulative distribution function (right) are shown in Figure 2.1. Note that the mass function is only non-zero at points that are elements of X, and that the cdf is defined for all real values of x, but is only continuous from the right. F X is therefore a step-function. PMF f(x) CDF F(x) f(x) 0.0 0.1 0.2 0.3 0.4 0.5 F(x) 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 x 0 2 4 6 8 10 x Figure 2.1: PMF f X (x) = ( ) 1 x 2, x = 1, 2,..., and CDF FX (x) = 1 ( 1 k(x). 2) Example 2.2 Consider an experiment to measure the length of time that an electrical component functions before failure. The sample space of outcomes of the experiment, Ω is R +, and if A x is the event that the component functions for longer than x > 0 time units, suppose that P(A x ) = exp { x 2}. Define continuous random variable X : Ω R +, by X(ω) = x component fails at time x. Then, if x > 0, F X (x) = P [X x] = 1 P (A x ) = 1 exp { x 2}

18 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS and F X (x) = 0 if x 0. Hence if x > 0, and zero otherwise. f X (x) = d dt {F X(t)} t=x = 2x exp { x 2}, Graphs of the probability density function (left) and cumulative distribution function (right) are shown in Figure 2.2. Note that both the pdf and cdf are defined for all real values of x, and that both are continuous functions. PDF f(x) CDF F(x) f(x) 0.0 0.2 0.4 0.6 0.8 F(x) 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 x 0 1 2 3 4 5 x Figure 2.2: PDF f X (x) = 2x exp{ x 2 }, x > 0, and CDF F X (x) = 1 exp{ x 2 }, x > 0. Note that here F X (x) = as f X (x) = 0 for x 0, and also that x f X (x)dx = f X (t)dt = 0 x 0 f X (t)dt f X (x)dx = 1.

2.4. EXPECTATIONS AND THEIR PROPERTIES 19 2.4 EXPECTATIONS AND THEIR PROPERTIES Definition 2.4.1 For a discrete random variable X with range X and with probability mass function f X, the expectation or expected value of X with respect to f X is defined by E fx [X] = x= xf X (x) = x X xf X (x). For a continuous random variable X with range X and pdf f X, the expectation or expected value of X with respect to f X is defined by E fx [X] = xf X (x)dx = xf X (x)dx. X NOTE : The sum/integral may not be convergent, and hence the expected value may be infinite. It is important always to check that the integral is finite: a sufficient condition is the absolute integrability of the summand/integrand, that is or in the continuous case x f X (x) < = x x x f X (x)dx < = xf X (x) = E fx [X] <, xf X (x)dx = E fx [X] <. Extension : Let g be a real-valued function whose domain includes X. Then g(x)f X (x), if X is discrete, x X E fx [g(x)] = g(x)f X (x)dx, if X is continuous. X PROPERTIES OF EXPECTATIONS Let X be a random variable with mass function/pdf f X. Let g and h be real-valued functions whose domains include X, and let a and b be constants. Then as (in the continuous case) E fx [ag(x) + bh(x)] = E fx [ag(x) + bh(x)] = ae fx [g(x)] + be fx [h(x)], = a [ag(x) + bh(x)]f X (x)dx X g(x)f X (x)dx + b X = ae fx [g(x)] + be fx [h(x)]. h(x)f X (x)dx X

20 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS SPECIAL CASES : (i) For a simple linear function E fx [ax + b] = ae fx [X] + b. (ii) Consider g(x) = (x E fx [X]) 2. Write μ =E fx [X] (a constant that does not depend on x). Then, expanding the integrand E fx [g(x)] = (x μ) 2 f X (x)dx = x 2 f X (x)dx 2μ xf X (x)dx + μ 2 f X (x)dx = x 2 f X (x)dx 2μ 2 + μ 2 = x 2 f X (x)dx μ 2 Then = E fx [X 2 ] {E fx [X]} 2. is the variance of the distribution. the distribution. V ar fx [X] = E fx [X 2 ] {E fx [X]} 2 Similarly, V ar fx [X] is the standard deviation of (iii) Consider g(x) = x k for k = 1, 2,... Then in the continuous case E fx [g(x)] = E fx [X k ] = x k f X (x)dx, and E fx [X k ] is the kth moment of the distribution. (iv) Consider g(x) = (x μ) k for k = 1, 2,... Then E fx [g(x)] = E fx [(X μ) k ] = (x μ) k f X (x)dx, and E fx [(X μ) k ] is the kth central moment of the distribution. (v) Consider g(x) = ax + b. Then V ar fx [ax + b] = a 2 V ar fx [X], 2.5 INDICATOR VARIABLES V ar fx [g(x)] = E fx [(ax + b E fx [ax + b]) 2 ] X X = E fx [(ax + b ae fx [X] b) 2 ] = E fx [(a 2 (X E fx [X]) 2 ] = a 2 V ar fx [X]. A particular class of random variables called indicator variables are particularly useful. Let A be an event and let I A : Ω R be the indicator function of A, so that { 1, if ω A, I A (ω) = 0, if ω A.

2.6. TRANSFORMATIONS OF RANDOM VARIABLES 21 Then I A is a random variable taking values 1 and 0 with probabilities P (A) and P (A ) respectively. Also, I A has expectation P (A) and variance P (A){1 P (A)}. The usefulness lies in the fact that any discrete random variable X can be written as a linear combination of indicator random variables: X = a i I Ai, i for some collection of events (A i, i 1) and real numbers (a i, i 1). Sometimes we can obtain the expectation and variance of a random variable X easily by expressing it in this way, then using knowledge of the expectation and variance of the indicator variables I Ai, rather than by direct calculation. 2.6 TRANSFORMATIONS OF RANDOM VARIABLES 2.6.1 GENERAL TRANSFORMATIONS Consider a discrete/continuous r.v. X with range X and probability distribution described by mass/pdf f X, or cdf F X. Suppose g is a real-valued function defined on X. Then Y = g(x) is also an r.v. (Y is also a function from Ω to R). Denote the range of Y by Y. For A R, the event [Y A] is an event in terms of the transformed variable Y. If f Y is the mass/density function for Y, then f Y (y), Y discrete, y A P [Y A] = f Y (y)dy, Y continuous. A We wish to derive the probability distribution of random variable Y ; in order to do this, we first consider the inverse transformation g 1 from Y to X defined for set A Y (and for y Y) by g 1 (A) = {x X : g(x) A}, g 1 (y) = {x X : g(x) = y}, that is, g 1 (A) is the set of points in X that map into A, and g 1 (y) is the set of points in X that map to y, under transformation g. By construction, we have P [Y A] = P [X g 1 (A)]. Then, for y R, we have f X (x), x A y F Y (y) = P [Y y] = P [g(x) y] = f X (x)dx, A y X discrete, X continuous, where A y = {x X : g(x) y}. This result gives the first principles approach to computing the distribution of the new variable. The approach can be summarized as follows: consider the range Y of the new variable; consider the cdf F Y (y). Step through the argument as follows F Y (y) = P [Y y] = P [g(x) y] = P [X A y ].

22 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS Transformation: Y=sin(X) Y = sin(x) -1.0-0.5 0.0 0.5 1.0 y x1 x2 0 1 2 3 4 5 6 X (radians) Figure 2.3: Computation of A y for Y = sin X. Note that it is usually a good idea to start with the cdf, not the pmf or pdf. Our main objective is therefore to identify the set A y = {x X : g(x) y}. Example 2.3 Suppose that X is a continuous r.v. with range X (0, 2π) whose pdf f X is constant f X (x) = 1 2π, 0 < x < 2π, and zero otherwise. This pdf has corresponding continuous cdf F X (x) = x 2π, 0 < x < 2π. Consider the transformed r.v. Y = sin X. Then the range of Y, Y, is [ 1, 1], but the transformation is not 1-1. However, from first principles, we have F Y (y) = P [Y y] = P [sin X y]. Now, by inspection of Figure 2.3, we can easily identify the required set A y, y > 0 : it is the union of two disjoint intervals Hence A y = [0, x 1 ] [x 2, 2π] = [ 0, sin 1 y ] [ π sin 1 y, 2π ].

2.6. TRANSFORMATIONS OF RANDOM VARIABLES 23 Transformation: T=tan(X) T = tan(x) -10-5 0 5 10 t x1 x2 0 1 2 3 4 5 6 X (radians) Figure 2.4: Computation of A y for T = tan X. F Y (y) = P [sin X y] = P [X x 1 ] + P [X x 2 ] = {P [X x 1 ]} + {1 P [X < x 2 ]} = { } { 1 2π sin 1 y + 1 1 ( π sin 1 y )} = 1 2π 2 + 1 π sin 1 y, and hence, by differentiation, f Y (y) = 1 π 1 1 y 2. [A symmetry argument verifies this for y < 0.] Example 2.4 Consider transformed r.v. T = tan X. Then the range of T, T, is R, but the transformation is not 1-1. However, from first principles, we have, for t > 0, F T (t) = P [T t] = P [tan X t]. Figure 2.4 helps identify the required set A t : in this case it is the union of three disjoint intervals ( π ] ( ] 3π A t = [0, x 1 ] 2, x 2 2, 2π = [ 0, tan 1 t ] ( π ] ( ] 3π 2, π + tan 1 t 2, 2π, (note, for values of t < 0, the union will be of only two intervals, but the calculation proceeds identically). Therefore, [ [ ] π 3π F T (t) = P [tan X t] = P [X x 1 ] + P 2] 2 < X x + P 2 < X 2π = { } 1 2π tan 1 t + 1 { π + tan 1 t π } + 1 { 2π 3π 2π 2 2π 2 } = 1 π tan 1 t + 1 2,

24 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS and hence, by differentiation, f T (t) = 1 π 1 1 + t 2. 2.6.2 1-1 TRANSFORMATIONS The mapping g(x), a function of X from X, is 1-1 and onto Y if for each y Y, there exists one and only one x X such that y = g(x). The following theorem gives the distribution for random variable Y = g(x) when g is 1-1. Theorem 2.6.1 THE UNIVARIATE TRANSFORMATION THEOREM Let X be a random variable with mass/density function f X and support X. Let g be a 1-1 function from X onto Y with inverse g 1. Then Y = g(x) is a random variable with support Y and Discrete Case : The mass function of random variable Y is given by f Y (y) = f X (g 1 (y)), y Y = {y f Y (y) > 0}, where x is the unique solution of y = g(x) (so that x = g 1 (y)). Continuous Case : The pdf of random variable Y is given by f Y (y) = f X (g 1 (y)) d { g 1 (t) } dt t=y, y Y = {y f Y (y) > 0}, where y = g(x), provided that the derivative is continuous and non-zero on Y. d { g 1 (t) } dt Proof. Discrete case : by direct calculation, f Y (y) = P [Y = y] = P [g(x) = y] = P [X = g 1 (y)] = f X (x) where x = g 1 (y), and hence f Y (y) > 0 f X (x) > 0. Continuous case : function g is either (I) a monotonic increasing, or (II) a monotonic decreasing function. Case (I): If g is increasing, then for x X and y Y, we have that Therefore, for y Y, g(x) y x g 1 (y). F Y (y) = P [Y y] = P [g(x) y] = P [X g 1 (y)] = F X (g 1 (y))

2.6. TRANSFORMATIONS OF RANDOM VARIABLES 25 and, by differentiation, because g is monotonic increasing, f Y (y) = f X (g 1 (y)) d { g 1 (t) } dt t=y = f X(g 1 (y)) d { g 1 (y) } dt t=y, as d { g 1 (t) } > 0. dt Case (II): If g is decreasing, then for x X and y Y we have g(x) y x g 1 (y) Therefore, for y Y, F Y (y) = P [Y y] = P [g(x) y] = P [X g 1 (y)] = 1 F X (g 1 (y)), so f Y (y) = f X (g 1 (y)) d { g 1 (y) } = f X (g 1 (y)) d { dt g 1 (t) } dt t=y as d { g 1 (t) } < 0. dt Definition 2.6.1 Suppose transformation g : X Y is 1-1, and is defined by g(x) = y for x X. Then the Jacobian of the transformation, denoted J(y), is given by J(y) = d dt { g 1 (t) } t=y, that is, the first derivative of g 1 evaluated at y = g(x). Note that the inverse transformation g 1 : Y X has Jacobian 1/J(x). NOTE : (i) The Jacobian is precisely the same term that appears as a change of variable term in an integration. (ii) In the Univariate Transformation Theorem, in the continuous case, we take the modulus of the Jacobian (iii) To compute the expectation of Y = g(x), we now have two alternative methods of computation; we either compute the expectation of g(x) with respect to the distribution of X, or compute the distribution of Y, and then its expectation. It is straightforward to demonstrate that the two methods are equivalent, that is E fx [g(x)] = E fy [Y ] This result is sometimes known as the Law of the Unconscious Statistician.

26 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS IMPORTANT NOTE: Note that the apparently ( appealing plug-in approach that sets f Y (y) = f X g 1 (y) ) will almost always fail as the Jacobian term must be included. For example, if Y = e X so that X = log Y, then merely setting f Y (y) = f X (log y) is insufficient, you must have f Y (y) = f X (log y) 1 y. 2.7 GENERATING FUNCTIONS 2.7.1 MOMENT GENERATING FUNCTIONS Definition 2.7.1 For random variable X with mass/density function f X, the moment generating function, or mgf, of X, M X, is defined by M X (t) = E fx [e tx ], if this expectation exists for all values of t ( h, h) for some h > 0, that is, where the sum/integral is over X. DISCRETE CASE M X (t) = e tx f X (x) CONTINUOUS CASE M X (t) = e tx f X (x)dx NOTE : It can be shown that if X 1 and X 2 are random variables taking values on X with mass/density functions f X1 and f X2, and mgfs M X1 and M X2 respectively, then f X1 (x) f X2 (x), x X M X1 (t) M X2 (t), t ( h, h). Hence there is a 1-1 correspondence between generating functions and distributions: this provides a key technique for identification of probability distributions. 2.7.2 KEY PROPERTIES OF MGFS (i) If X is a discrete random variable, the rth derivative of M X evaluated at t, M (r) X (t), is given by and hence M (r) dr X (t) = ds r {M X(s)} s=t = dr { } e sx ds r f X (x) = x r e tx f X (x) s=t M (r) X (0) = x r f X (x) = E fx [X r ].

2.7. GENERATING FUNCTIONS 27 If X is a continuous random variable, the rth derivative of M X is given by { } M (r) dr X (t) = ds r e sx f X (x)dx = x r e tx f X (x)dx and hence M (r) X (0) = s=t x r f X (x)dx = E fx [X r ]. (ii) If X is a discrete random variable, then M X (t) = e tx f X (x) = { } (tx) r f X (x) r! = 1 + r=1 The identical result holds for the continuous case. r=0 t r { x r f X (x)} = 1 + r! r=1 t r r! E f X [X r ]. (iii) From the general result for expectations of functions of random variables, E fy [e ty ] E fx [e t(ax+b) ] = M Y (t) = E fx [e t(ax+b) ] = e bt E fx [e atx ] = e bt M X (at). Therefore, if Y = ax + b, M Y (t) = e bt M X (at) Theorem 2.7.1 Let X 1,..., X k be independent random variables with mgfs M X1,..., M Xk respectively. Then if the random variable Y is defined by Y = X 1 +... + X k, M Y (t) = k M Xi (t). Proof. For k = 2, if X 1 and X 2 are independent, integer-valued, discrete r.v.s, then if Y = X 1 + X 2, by the Theorem of Total Probability, f Y (y) = P [Y = y] = x 1 P [Y = y X 1 = x 1 ] P [X 1 = x 1 ] = x 1 f X2 (y x 1 ) f X1 (x 1 ). Hence M Y (t) = E fy [e ty ] = y e ty f Y (y) = y e ty { x 1 f X2 (y x 1 ) f X1 (x 1 ) } = { } e t(x 1+x 2 ) f X2 (x 2 ) f X1 (x 1 ) (changing variables in the summation, x 2 = y x 1 ) x 2 x 1 { } { } = e tx1 f X1 (x 1 ) e tx2 f X2 (x 2 ) = M X1 (t)m X2 (t), x 1 x 2

28 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS and the result follows for general k by recursion. The result for continuous random variables follows in the obvious way. Special Case : If X 1,..., X k are identically distributed, then M Xi (t) M X (t), say, for all i, so k M Y (t) = M X (t) = {M X (t)} k. 2.7.3 OTHER GENERATING FUNCTIONS Definition 2.7.2 For random variable X, with mass/density function f X, the factorial moment or probability generating function, fmgf or pgf, of X, G X, is defined by G X (t) = E fx [t X ] = E fx [e X log t ] = M X (log t), if this expectation exists for all values of t (1 h, 1 + h) for some h > 0. Properties : (i) Using similar techniques to those used for the mgf, it can be shown that G (r) X (t) = d r ds r {G X(s)} s=t = E fx [ X(X 1)...(X r + 1)t X r ] = G (r) X (1) = E f X [X(X 1)...(X r + 1)], where E fx [X(X 1)...(X r + 1)] is the rth factorial moment. (ii) For discrete random variables, it can be shown by using a Taylor series expansion of G X that, for r = 1, 2,..., G (r) X (0) = P [X = r]. r! Definition 2.7.3 For random variable X with mass/density function f X, the cumulant generating function of X, K X, is defined by K X (t) = log [M X (t)], for t ( h, h) for some h > 0. Moment generating functions provide a very useful technique for identifying distributions, but suffer from the disadvantage that the integrals which define them may not always be finite. Another class of functions which are equally useful and whose finiteness is guaranteed is described next. Definition 2.7.4 The characteristic function, or cf, of X, C X, is defined by C X (t) = E fx [ e itx ].

2.8. JOINT PROBABILITY DISTRIBUTIONS 29 By definition C X (t) = x X e itx f X (x)dx = x X [cos tx + i sin tx] f X (x)dx = x X cos txf X (x)dx + i sin txf X (x)dx x X = E fx [cos tx] + ie fx [sin tx]. We will be concerned primarily with cases where the moment generating function exists, and the use of moment generating functions will be a key tool for identification of distributions. 2.8 JOINT PROBABILITY DISTRIBUTIONS Suppose X and Y are random variables on the probability space (Ω, A, P (.)). Their distribution functions F X and F Y contain information about their associated probabilities. But how do we describe information about their properties relative to each other? We think of X and Y as components of a random vector (X, Y ) taking values in R 2, rather than as unrelated random variables each taking values in R. Example 2.5 Toss a coin n times and let X i = 0 or 1, depending on whether the ith toss is a tail or a head. The random vector X = (X 1,..., X n ) describes the whole experiment. The total number of heads is n X i. The joint distribution function of a random vector (X 1,..., X n ) is P (X 1 x 1,..., X n x n ), a function of n real variables x 1,..., x n. For vectors x = (x 1,..., x n ) and y = (y 1,..., y n ), write x y if x i y i for each i = 1,..., n. Definition 2.8.1 The joint distribution function of a random vector X = (X 1,..., X n ) on (Ω, A, P (.)) is given by F X : R n [0, 1], defined by F X (x) = P (X x), x R n. [Remember, formally, {X x} means {ω Ω : X(ω) x}.] We will consider, for simplicity, the case n = 2, without any loss of generality: the case n > 2 is just notationally more cumbersome. Properties of the joint distribution function. The joint distribution function F X,Y (i) of the random vector (X, Y ) satisfies: lim F X,Y (x, y) = 0, x,y lim x,y F X,Y (x, y) = 1.

30 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS (ii) If (x 1, y 1 ) (x 2, y 2 ) then F X,Y (x 1, y 1 ) F X,Y (x 2, y 2 ). (iii) F X,Y is continuous from above, F X,Y (x + u, y + v) F X,Y (x, y), as u, v 0 +. (iv) lim F X,Y (x, y) = F X (x) P (X x), y lim F X,Y (x, y) = F Y (y) P (Y y). x F X and F Y are the marginal distribution functions of the joint distribution F X,Y. Definition 2.8.2 The random variables X and Y on (Ω, A, P (.)) are (jointly) discrete if (X, Y ) takes values in a countable subset of R 2 only. Definition 2.8.3 Discrete variables X and Y are independent if the events {X = x} and {Y = y} are independent for all x and y. Definition 2.8.4 The joint probability mass function f X,Y : R 2 [0, 1] of X and Y is given by f X,Y (x, y) = P (X = x, Y = y). The marginal pmf of X, f X (x), is found from: f X (x) = P (X = x) = P (X = x, Y = y) y = y f X,Y (x, y). Similarly for f Y (y). The definition of independence can be reformulated as: X and Y are independent iff f X,Y (x, y) = f X (x)f Y (y), for all x, y R. More generally, X and Y are independent iff f X,Y (x, y) can be factorized as the product g(x)h(y) of a function of x alone and a function of y alone. Let X be the support of X and Y be the support of Y. Then Z = (X, Y ) has support Z = {(x, y) : f X,Y (x, y) > 0}. In nice cases Z = X Y, but we need to be alert to cases with Z X Y. In general, given a random vector (X 1,..., X k ) we will denote its range or support by X (k).

2.8. JOINT PROBABILITY DISTRIBUTIONS 31 Definition 2.8.5 The conditional distribution function of Y given X = x, F Y X (y x) is defined by F Y X (y x) = P (Y y X = x), for any x such that P (X = x) > 0. The conditional probability mass function of Y given X = x, f Y X (y x), is defined by for any x such that P (x = x) > 0. Turning now to the continuous case, we define: f Y X (y x) = P (Y = y X = x), Definition 2.8.6 The random variables X and Y on (Ω, A, P (.)) are called jointly continuous if their joint distribution function can be expressed as F X,Y (x, y) = x, y R, for some f X,Y : R 2 [0, ). x u= y v= f X,Y (u, v)dvdu, Then f X,Y is the joint probability density function of X, Y. If F X,Y is sufficiently differentiable at (x, y) we have f X,Y (x, y) = 2 x y F X,Y (x, y). This is the usual case, which we will assume from now on. Then: (i) P (a X b, c Y d) = F X,Y (b, d) F X,Y (a, d) F X,Y (b, c) + F X,Y (a, c) = d b c a f X,Y (x, y)dxdy. If B is a nice subset of R 2, such as a union of rectangles, P ((X, Y ) B) = f X,Y (x, y)dxdy. (ii) The marginal distribution functions of X and Y are: F X (x) = P (X x) = F X,Y (x, ), F Y (y) = P (Y y) = F X,Y (, y). B

32 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS Since x { } F X (x) = f X,Y (u, y)dy du, we see, differentiating with respect to x, that the marginal pdf of X is f X (x) = f X,Y (x, y)dy. Similarly, the marginal pdf of Y is f Y (y) = f X,Y (x, y)dx. We cannot, as we did in the discrete case, define independence of X and Y in terms of events {X = x} and {Y = y}, as these have zero probability and are trivially independent. So, Definition 2.8.7 X and Y are independent if {X x} and {Y y} are independent events, for all x, y R. So, X and Y are independent iff or (equivalently) iff whenever F X,Y F X,Y (x, y) = F X (x)f Y (y), x, y R, is differentiable at (x, y). f X,Y (x, y) = f X (x)f Y (y), (iv) Definition 2.8.8 The conditional distribution function of Y given X = x, F Y X (y x) or P (Y y X = x) is defined as for any x such that f X (x) > 0. Definition 2.8.9 defined by for any x such that f X (x) > 0. F Y X (y x) = y v= f X,Y (x, v) dv, f X (x) The conditional density function of Y, given X = x, f Y X (y x), is f Y X (y x) = f X,Y (x, y), f X (x) This is an appropriate point to remark that not all random variables are either continuous or discrete, and not all distribution functions are either absolutely continuous or discrete. Many practical examples exist of distribution functions that are partly discrete and partly continuous.

2.8. JOINT PROBABILITY DISTRIBUTIONS 33 Example 2.6 We record the delay that a motorist encounters at a one-way traffic stop sign. Let X be the random variable representing the delay the motorist experiences. There is a certain probability that there will be no opposing traffic, so she will be able to proceed without delay. However, if she has to wait, she could (in principle) have to wait for any positive amount of time. The experiment could be described by assuming that X has distribution function F X (x) = (1 pe λx )I [0, ) (x). This has a jump of 1 p at x = 0, but is continuous for x > 0: there is a probability 1 p of no wait at all. We shall see later cases of random vectors, (X, Y ) say, where one component is discrete and the other continuous: there is no essential complication in the manipulation of the marginal distributions etc. for such a case. 2.8.1 THE CHAIN RULE FOR RANDOM VARIABLES As with the chain rule for manipulation of probabilities, there is an explicit relationship between joint, marginal, and conditional mass/density functions. For example, consider three continuous random variables X 1, X 2, X 3, with joint pdf f X1,X 2,X 3. Then, f X1,X 2,X 3 (x 1, x 2, x 3 ) = f X1 (x 1 )f X2 X 1 (x 2 x 1 )f X3 X 1,X 2 (x 3 x 1, x 2 ), so that, for example, f X1 (x 1 ) = = = X 2 X 2 X 2 f X1,X 2,X 3 (x 1, x 2, x 3 )dx 2 dx 3 X 3 f X1 X 2,X 3 (x 1 x 2, x 3 )f X2,X 3 (x 2, x 3 )dx 2 dx 3 X 3 f X1 X 2,X 3 (x 1 x 2, x 3 )f X2 X 3 (x 2 x 3 )f X3 (x 3 )dx 2 dx 3. X 3 Equivalent relationships hold in the discrete case and can be extended to determine the explicit relationship between joint, marginal, and conditional mass/density functions for any number of random variables. NOTE: the discrete equivalent of this result is a DIRECT consequence of the Theorem of Total Probability; the event [X 1 = x 1 ] is partitioned into sub-events [(X 1 = x 1 ) (X 2 = x 2 ) (X 3 = x 3 )] for all possible values of the pair (x 2, x 3 ). 2.8.2 CONDITIONAL EXPECTATION AND ITERATED EXPECTATION Consider two discrete/continuous random variables X 1 and X 2 with joint mass function/pdf f X1,X 2, and the conditional mass function/pdf of X 1 given X 2 = x 2, defined in the usual way by f X1 X 2 (x 1 x 2 ) = f X 1,X 2 (x 1, x 2 ). f X2 (x 2 )

34 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS Then the conditional expectation of g(x 1 ) given X 2 = x 2 is defined by g(x 1 )f X1 X 2 (x 1 x 2 ), x 1 X 1 X 1 DISCRETE. E fx1 X 2 [g(x 1 ) X 2 = x 2 ] = g(x 1 )f X1 X 2 (x 1 x 2 )dx 1, X 1 X 1 CONTINUOUS, i.e. the expectation of g(x 1 ) with respect to the conditional density of X 1 given X 2 = x 2, (possibly giving a function of x 2 ). The case g(x) x is a particular case. Theorem 2.8.1 THE LAW OF ITERATED EXPECTATION For two continuous random variables X 1 and X 2 with joint pdf f X1,X 2, ] E fx1 [g(x 1 )] = E fx2 [E fx1 X 2 [g(x 1 ) X 2 = x 2 ]. Proof E fx1 [g(x 1 )] = g(x 1 )f X1 (x 1 )dx 1 X 1 = = = = = { g(x 1 ) X 1 { g(x 1 ) X 1 X 1 X 2 X 2 { X 2 f X1,X 2 (x 1, x 2 )dx 2 } dx 1 X 2 f X1 X 2 (x 1 x 2 )f X2 (x 2 )dx 2 X 2 g(x 1 )f X1 X 2 (x 1 x 2 )f X2 (x 2 )dx 2 dx 1 X 1 g(x 1 )f X1 X 2 (x 1 x 2 )dx 1 } dx 1 } f X2 (x 2 )dx 2 { } E fx1 X 2 [g(x 1 ) X 2 = x 2 ] f X2 (x 2 )dx 2 ] = E fx2 [E fx1 X 2 [g(x 1 ) X 2 = x 2 ], so the expectation of g(x 1 ) can be calculated by finding the conditional expectation of g(x 1 ) given X 2 = x 2, giving a function of x 2, and then taking the expectation of this function with respect to the marginal density for X 2. Note that this proof only works if the conditional expectation and the marginal expectation are finite. This results extends naturally to k variables. 2.9 MULTIVARIATE TRANSFORMATIONS Theorem 2.9.1 THE MULTIVARIATE TRANSFORMATION THEOREM Let X = (X 1,..., X k ) be a vector of random variables, with joint mass/density function f X1,...,X k.

2.9. MULTIVARIATE TRANSFORMATIONS 35 Let Y = (Y 1,..., Y k ) be a vector of random variables defined by Y i = g i (X 1,..., X k ) for some functions g i, i = 1,..., k, where the vector function g mapping (X 1,..., X k ) to (Y 1,..., Y k ) is a 1-1 transformation. Then the joint mass/density function of (Y 1,..., Y k ) is given by DISCRETE f Y1,...,Y k (y 1,..., y k ) = f X1,...,X k (x 1,..., x k ), CONTINUOUS f Y1,...,Y k (y 1,..., y k ) = f X1,...,X k (x 1,..., x k ) J(y 1,..., y k ), where x = (x 1,..., x k ) is the unique solution of the system y = g(x), so that x = g 1 (y), and where J(y 1,..., y k ) is the Jacobian of the transformation, that is, the determinant of the k k matrix whose (i, j)th element is where gi 1 modulus. t j { gi 1 (t) } t 1 =y 1,...,t k =y k, is the inverse function uniquely defined by X i = gi 1 (Y 1,..., Y k ). Note again the Proof. The discrete case proof follows the univariate case precisely. For the continuous case, consider the equivalent events [X C] and [Y D], where D is the image of C under g. Clearly, P [X C] = P [Y D]. Now, P [X C] is the k dimensional integral of the joint density f X1,...,X k over the set C, and P [Y D] is the k dimensional integral of the joint density f Y1,...,Y k over the set D. The result follows by changing variables in the first integral from x to y = g(x), and equating the two integrands. Note : As for single variable transformations, the ranges of the transformed variables must be considered carefully. Example 2.7 The multivariate transformation theorem provides a simple proof of the convolution formula: if X and Y are independent continuous random variables with pdfs f X (x) and f Y (y), then the pdf of Z = X + Y is f Z (z) = f X (w)f Y (z w)dw. Let W = X. The Jacobian of the transformation from (X, Y ) to (Z, W ) is 1. So, the joint pdf of (Z, W ) is f Z,W (z, w) = f X,Y (w, z w) = f X (w)f Y (z w). Then integrate out W to obtain the marginal pdf of Z. Example 2.8 Consider the case k = 2, and suppose that X 1 and X 2 are independent continuous random variables with ranges X 1 = X 2 = [0, 1] and pdfs given respectively by f X1 (x 1 ) = 6x 1 (1 x 1 ), 0 x 1 1, f X2 (x 2 ) = 3x 2 2, 0 x 2 1, and zero elsewhere. In order to calculate the pdf of random variable Y 1 defined by Y 1 = X 1 X 2,

36 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS using the transformation result, consider the additional random variable Y 2, where Y 2 = X 1 (note, as X 1 and X 2 take values on [0, 1], X 1 X 1 X 2 so Y 1 Y 2 ). The transformation Y = g(x) is then specified by the two functions g 1 (t 1, t 2 ) = t 1 t 2, g 2 (t 1, t 2 ) = t 1, and the inverse transformation X = g 1 (Y) (i.e. X in terms of Y) is X 1 = Y 2, X 2 = Y 1 /Y 2, giving g 1 1 (t 1, t 2 ) = t 2, g 1 2 (t 1, t 2 ) = t 1 /t 2. Hence t 1 { g1 1 (t 1, t 2 ) } = 0, t 1 { g2 1 (t 1, t 2 ) } = 1/t 2, t 2 { g1 1 (t 1, t 2 ) } = 1, t 2 { g2 1 (t 1, t 2 ) } = t 1 /t 2 2, and so the Jacobian J(y 1, y 2 ) of the transformation is given by 0 1 1/y 2 y 1 /y 2 2 so that J(y 1, y 2 ) = 1/y 2. Hence, using the theorem f Y1,Y 2 (y 1, y 2 ) = f X1,X 2 (y 2, y 1 /y 2 ) J(y 1, y 2 ) = 6y 2 (1 y 2 ) 3(y 1 /y 2 ) 2 1/y 2 = 18y 2 1 (1 y 2)/y 2 2, on the set Y (2) = {(y 1, y 2 ) : 0 y 1 y 2 1}, and zero otherwise. Hence f Y1 (y 1 ) = 1 y 1 18y 2 1 (1 y 2)/y 2 2 dy 2 = 18y1 2 [ 1/y 2 log y 2 ] 1 y 1 = 18y1 2( 1 + 1/y 1 + log y 1 ) = 18y 1 (1 y 1 + y 1 log y 1 ), for 0 y 1 1, and zero otherwise.

2.10. MULTIVARIATE EXPECTATIONS AND COVARIANCE 37 2.10 MULTIVARIATE EXPECTATIONS AND COVARIANCE 2.10.1 EXPECTATION WITH RESPECT TO JOINT DISTRIBUTIONS Definition 2.10.1 For random variables X 1,..., X k with range X (k) with mass/density function f X1,...,X k, the expectation of g(x 1,..., X k ) is defined in the discrete and continuous cases by... g(x 1,..., x k )f X1,...,X k (x 1,..., x k ), X 1 X k E fx1,...,x k [g(x 1,..., X k )] =... X 1 X k g(x 1,..., x k )f X1,...,X k (x 1,..., x k )dx 1...dx k. PROPERTIES (i) Let g and h be real-valued functions and let a and b be constants. Then, if f X f X1,...,X k, E fx [ag(x 1,..., X k ) + bh(x 1,..., X k )] = ae fx [g(x 1,..., X k )] + be fx [h(x 1,..., X k )]. (ii) Let X 1,...X k be independent random variables with mass functions/pdfs f X1,..., f Xk respectively. Let g 1,..., g k be scalar functions of X 1,..., X k respectively (that is, g i is a function of X i only for i = 1,..., k). If g(x 1,..., X k ) = g 1 (X 1 )...g k (X k ), then E fx [g(x 1,..., X k )] = k E fxi [g i (X i )], where E fxi [g i (X i )] is the marginal expectation of g i (X i ) with respect to f Xi. (iii) Generally, E fx [g(x 1 )] E fx1 [g(x 1 )], so that the expectation over the joint distribution is the same as the expectation over the marginal distribution. The proof is an immediate consequence of the fact that the marginal pdf f X1 is obtained by integrating the joint density with respect to x 2,..., x k. So, whevever we wish, it is reasonable to denote the expectation as, say, E[g(X 1 )], rather than E fx1 [g(x 1 )] or E fx [g(x 1 )]: we can drop subscripts. 2.10.2 COVARIANCE AND CORRELATION Definition 2.10.2 The covariance of two random variables X 1 and X 2 is denoted Cov fx1,x 2 [X 1, X 2 ], and is defined by Cov fx1,x 2 [X 1, X 2 ] = E fx1,x 2 [(X 1 μ 1 )(X 2 μ 2 )] = E fx1,x 2 [X 1 X 2 ] μ 1 μ 2, where μ i = E fxi [X i ] is the marginal expectation of X i, for i = 1, 2, and where E fx1,x 2 [X 1 X 2 ] = x 1 x 2 f X1,X 2 (x 1, x 2 )dx 1 dx 2, that is, the expectation of function g(x 1, x 2 ) = x 1 x 2 with respect to the joint distribution f X1,X 2.

38 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS Definition 2.10.3 The correlation of X 1 and X 2 is denoted Corr fx1,x 2 [X 1, X 2 ], and is defined by Cov fx1,x Corr fx1,x 2 [X 1, X 2 ] = 2 [X 1, X 2 ]. V ar fx1 [X 1 ]V ar fx2 [X 2 ] If Cov fx1,x 2 [X 1, X 2 ] = Corr fx1,x 2 [X 1, X 2 ] = 0 then variables X 1 and X 2 are uncorrelated. Note that if random variables X 1 and X 2 are independent then Cov fx1,x 2 [X 1, X 2 ] = E fx1,x 2 [X 1 X 2 ] E fx1 [X 1 ]E fx2 [X 2 ] = E fx1 [X 1 ]E fx2 [X 2 ] E fx1 [X 1 ]E fx2 [X 2 ] = 0, and so X 1 and X 2 are also uncorrelated (the converse does not hold). NOTES: (i) For random variables X 1 and X 2, with (marginal) expectations μ 1 and μ 2 respectively, and (marginal) variances σ 2 1 and σ2 2 respectively, if random variables Z 1 and Z 2 are defined by Z 1 = (X 1 μ 1 )/σ 1, Z 2 = (X 2 μ 2 )/σ 2, then Z 1 and Z 2 are standardized variables. Then E fzi [Z i ] = 0, V ar fzi [Z i ] = 1 and Corr fx1,x 2 [X 1, X 2 ] = Cov fz1,z 2 [Z 1, Z 2 ]. (ii) Extension to k variables: covariances can only be calculated for pairs of random variables, but if k variables have a joint probability structure it is possible to construct a k k matrix, C say, of covariance values, whose (i, j)th element is Cov fxi,x j [X i, X j ], for i, j = 1,.., k, that captures the complete covariance structure in the joint distribution. If i j, then Cov fxj,x i [X j, X i ] = Cov fxi,x j [X i, X j ], so C is symmetric, and if i = j, Cov fxi,x i [X i, X i ] V ar fxi [X i ]. The matrix C is referred to as the variance-covariance matrix. (iii) If random variable X is defined by X = a 1 X 1 +a 2 X 2 +...a k X k, for random variables X 1,..., X k and constants a 1,..., a k, then E fx [X] = k a i E fxi [X i ], k k V ar fx [X] = a 2 i V ar i 1 f Xi [X i ] + 2 a i a j Cov fxi,x j [X i, X j j=1 = a T Ca, a = (a 1,..., a k ) T.

2.10. MULTIVARIATE EXPECTATIONS AND COVARIANCE 39 (iv) Combining (i) and (iii) when k = 2, and defining standardized variables Z 1 and Z 2, 0 V ar fz1,z 2 [Z 1 ± Z 2 ] = V ar fz1 [Z 1 ] + V ar fz2 [Z 2 ] ± 2Cov fz1,z 2 [Z 1, Z 2 ] = 1 + 1 ± 2Corr fx1,x 2 [X 1, X 2 ] = 2(1 ± Corr fx1,x 2 [X 1, X 2 ]) and hence 1 Corr fx1,x 2 [X 1, X 2 ] 1. 2.10.3 JOINT MOMENT GENERATING FUNCTION Definition 2.10.4 Let X and Y be jointly distributed. The joint moment generating function of X and Y is M X,Y (s, t) = E(e sx+ty ). If this exists in a neighbourhood of the origin (0,0), then it has the same attractive properties as the ordinary moment generating function. It determines the joint distribution of X and Y uniquely and it also yields the moments: m+n s m t n M X,Y (s, t) = E(X m Y n ). s=t=0 Joint moment generating functions factorize for independent random variables. We have M X,Y (s, t) = M X (s)m Y (t) if and only if X and Y are independent. Note, M X (s) = M X,Y (s, 0), etc. The definition of the joint moment generating function extends in an obvious way to three or more random variables, with the corresponding result for independence. For instance, the random variable X is independent of (Y, Z) if and only if the joint moment generating function M X,Y,Z (s, t, u) = E(e sx+ty +uz ) factorizes as M X,Y,Z (s, t, u) = M X (s)m Y,Z (t, u). 2.10.4 FURTHER RESULTS ON INDEPENDENCE The following results are useful: (I) Let X and Y be independent random variables. Let g(x) be a function only of x and h(y) be a function only of y. Then the random variables U = g(x) and V = h(y ) are independent. (II) Let X 1,..., X n be independent random vectors. Let g i (x i ) be a function only of x i, i = 1,..., n. Then the random variables U i = g i (X i ), i = 1,..., n are independent.

40 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS 2.11 ORDER STATISTICS Order statistics, like sample moments, play a very important role in statistical inference. Let X 1,..., X n be independent, identically distributed continuous random variables, with cdf F X and pdf f X. Then order the X i : let Y 1 be the smallest of {X 1,..., X n }, Y 2 be the second smallest of {X 1,..., X n },..., Y n be the largest of {X 1,..., X n }. Note that since we assume continuity, the chance of ties is zero. It is customary to use the notation Y k = X (k), and then X (1), X (2),..., X (n) are known as the order statistics of X 1,..., X n. Two key results are: Result A The order statistics have joint density n! n f X (y i ), y 1 < y 2 <... < y n. Result B The order statistic X (k) has density ( ) n f (k) (y) = k f X (y){1 F X (y)} n k {F X (y)} k 1. k An informal proof of Result A is straightforward, using a symmetry argument based on the independent, identically distributed nature of X 1,..., X n. Result B follows on noting that the event X (k) y occurs if and only if at least k of the X i lie in (, y]. Recalling the Binomial distribution, this means that X (k) has distribution function F (k) (y) = n j=k The pdf follows on differentiating this cdf. ( ) n {F X (y)} j {1 F X (y)} n j. j

CHAPTER 3 DISCRETE PROBABILITY DISTRIBUTIONS Definition 3.1.1 DISCRETE UNIFORM DISTRIBUTION X Uniform(n) f X (x) = 1, x X = {1, 2,..., n}, n and zero otherwise. Definition 3.1.2 BERNOULLI DISTRIBUTION X Bernoulli(θ) and zero otherwise. f X (x) = θ x (1 θ) 1 x, x X = {0, 1}, NOTE The Bernoulli distribution is used for modelling when the outcome of an experiment is either a success or a failure, where the probability of getting a success is equal to θ. Such an experiment is a Bernoulli trial. The mgf is M X (t) = (1 θ) + θe t. Definition 3.1.3 BINOMIAL DISTRIBUTION X Bin(n, θ) f X (x) = ( ) n θ x (1 θ) n x, x X = {0, 1, 2,..., n}, n 1, 0 θ 1. x NOTES 1. If X 1,..., X k are independent and identically distributed (IID) Bernoulli(θ) random variables, and Y = X 1 +..., X k, then by the standard result for mgfs, M Y (t) = {M X (t)} k = ( 1 θ + θe t) k, so therefore Y Bin(k, θ) because of the uniqueness of mgfs. Thus the binomial distribution is used to model the total number of successes in a series of independent and identical experiments. 2. Alternatively, consider sampling without replacement from infinite collection, or sampling with replacement from a finite collection of objects, a proportion θ of which are of Type I, and the remainder are of Type II. If X is the number of Type I objects in a sample of n, X Bin(n, θ). 41

42 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS Definition 3.1.4 POISSON DISTRIBUTION X P oisson(λ) and zero otherwise f X (x) = e λ λ x, x X = {0, 1, 2,...}, λ > 0, x! NOTES 1. If X Bin(n, θ), let λ = nθ. Then M X (t) = ( 1 θ + θe t) ( ) n n = 1 + λ(et 1)) exp { λ ( e t 1 )}, n as n, which is the mgf of a Poisson random variable. Therefore, the Poisson distribution arises as the limiting case of the binomial distribution, when n, θ 0 with nθ = λ constant (that is, for large n and small θ). So, if n is large, θ is small, we can reasonably approximate Bin(n, θ) by P oisson(λ). 2. Suppose that X 1 and X 2 are independent, with X 1 P oisson(λ 1 ), X 2 P oisson(λ 2 ), then if Y = X 1 + X 2, using the general mgf result for independent random variables, M Y (t) = M X1 (t)m X2 (t) = exp { λ 1 ( e t 1 )} exp { λ 2 ( e t 1 )} = exp { (λ 1 + λ 2 ) ( e t 1 )} so that Y P oisson(λ 1 + λ 2 ). Therefore, the sum of two independent Poisson random variables also has a Poisson distribution. This result can be extended easily; if X 1,..., X k are independent random variables with X i P oisson(λ i ) for i = 1,..., k, then Y = ( k k ) X i = Y P oisson λ i. 3. THE POISSON PROCESS (This material is not examinable.) The Poisson distribution arises as part of a larger modelling framework. Consider an experiment involving events (such as radioactive emissions) that occur repeatedly and randomly in time. Let X(t) be the random variable representing the number of events that occur in the interval [0, t], so that X(t) takes values 0, 1, 2,... Informally, in a Poisson process we have the following properties. In the interval (t, t + h) there may or may not be events. If h is small, the probability of an event in (t, t + h) is roughly proportional to h; it is not very likely that two or more events occur in a small interval. Formally, a Poisson process with intensity λ is a process X(t), t 0, taking values in {0, 1, 2,...} such that (a) X(0) = 0 and if s < t then X(s) X(t), (b) λh + o(h), m = 1, P (X(t + h) = n + m X(t) = n) = o(h), m > 1, 1 λh + o(h), m = 0.

43 (c) If s < t, the number of events X(t) X(s) in (s, t] is independent of the number of events in [0, s]. Here [definition] o(h) lim h 0 h = 0. Then, if it can be shown that P n (t) = P [X(t) = n] = P [n events occur in [0, t]] P n (t) = e λt (λt) n n! (that is, the random variable corresponding to the number of events that occurs in the interval [0, t] has a Poisson distribution with parameter λt.) Definition 3.1.5 GEOMETRIC DISTRIBUTION X Geometric(θ) f X (x) = (1 θ) x 1 θ, x X = {1, 2,...}, 0 θ 1. NOTES 1. The cdf is available analytically as 2. If X Geometric(θ), then for x, j 1, F X (x) = 1 (1 θ) x, x = 1, 2, 3,... P [X = x+j X > j] = P [X = x + j, X > j] P [X > j] = P [X = x + j] P [X > j] = (1 θ)x+j 1 θ (1 θ) j = (1 θ) x 1 θ = P [X = x]. So P[X = x + j X > j] =P[X = x]. This property is unique (among discrete distributions) to the geometric distribution, and is called the lack of memory property. 3. Alternative representations are sometimes useful: f X (x) = φ x 1 (1 φ), x = 1, 2, 3... (that is, φ = 1 θ), f X (x) = φ x (1 φ), x = 0, 1, 2,.... 4. The geometric distribution is used to model the number, X, of independent, identical Bernoulli trials until the first success is obtained. It is a discrete waiting time distribution. Definition 3.1.6 NEGATIVE BINOMIAL DISTRIBUTION X NegBin(n, θ) ( ) x 1 f X (x) = θ n (1 θ) x n, n 1 x X = {n, n + 1,...}, n 1, 0 θ 1.

44 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS NOTES 1. If X Bin(n, θ), Y NegBin(r, θ), then for r n, P [X r] = P [Y n]. 2. The Negative Binomial distribution is used to model the number, X, of independent, identical Bernoulli trials needed to obtain exactly n successes. [The number of trials up to and including the nth success]. 3. Alternative representation: let Y be the number of failures in a sequence of independent, identical Bernoulli trials that contains exactly n successes. Then Y = X n, and hence ( ) n + y 1 f Y (y) = θ n (1 θ) y, y {0, 1,...}. n 1 4. If X i Geometric(θ), for i = 1,...n, are i.i.d. random variables, and Y = X 1 +... + X n, then Y N egbin(n, θ) (result immediately follows using mgfs). 5. If X NegBin(n, θ), let n(1 θ)/θ = λ and Y = X n. Then { } { } M Y (t) = e nt θ n n 1 M X (t) = 1 e t = (1 θ) 1 λ exp { λ(e t 1) }, n (et 1) as n, hence the alternate form of the negative binomial distribution tends to the Poisson distribution as n with n(1 θ)/θ = λ constant. Definition 3.1.7 HYPERGEOMETRIC DISTRIBUTION X HypGeom(N, R, n) for N R n ( )( ) N R R f X (x) = n x x ( ) N, x X = {max(0, n N + R),..., min(n, R)}, n and zero otherwise. NOTES 1. The hypergeometric distribution is used as a model for experiments involving sampling without replacement from a finite population. Specifically, consider a finite population of size N, consisting of R items of Type I and N R of Type II: take a sample of size n without replacement, and let X be the number of Type I objects on the sample. The mass function for the hypergeometric distribution can be obtained by using combinatorics/counting techniques. However the form of the mass function does not lend itself readily to calculation of moments etc.. 2. As N, R with R/N = θ(constant), then ( ) n P [X = x] θ x (1 θ) n x, x so the distribution tends to a Binomial distribution.

CHAPTER 4 CONTINUOUS PROBABILITY DISTRIBUTIONS Definition 4.1.1 CONTINUOUS UNIFORM DISTRIBUTION X Uniform(a, b) NOTES 1. The cdf is f X (x) = 1 b a, a x b. F X (x) = x a b a, a x b. 2. The case a = 0 and b = 1 gives the Standard uniform. Definition 4.1.2 EXPONENTIAL DISTRIBUTION X Exp(λ) NOTES 1. The cdf is f X (x) = λe λx, x > 0, λ > 0. F X (x) = 1 e λx, x > 0. 2. An alternative representation uses θ = 1/λ as the parameter of the distribution. This is sometimes used because the expectation and variance of the Exponential distribution are E fx [X] = 1 λ = θ, V ar f X [X] = 1 λ 2. 3. If X Exp(λ), then, for all x, t > 0, P [X > x + t X > t] = P [X > x + t, X > t] P [X > t] = P [X > x + t] P [X > t] = e λ(x+t) e λt = e λx = P [X > x]. Thus, for all x, t > 0, P [X > x + t X > t] = P [X > x] - this is known as the Lack of Memory Property, and is unique to the exponential distribution amongst continuous distributions. 4. Suppose that X(t) is a Poisson process with rate parameter λ > 0, so that P [X(t) = n] = e λt (λt) n. n! Let X 1,..., X n be random variables defined by X 1 = time that first event occurs, and, for i = 2,..., n, X i = time interval between occurrence of (i 1)st and ith events. Then X 1,..., X n 45

46 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS are IID because of the assumptions underlying the Poisson process. So consider the distribution of X 1 ; in particular, consider the probability P[X 1 > x] for x > 0. The event [X 1 > x] is equivalent to the event No events occur in the interval (0,x], which has probability e λx. But F X1 (x) = P [X 1 x] = 1 P [X 1 > x] = 1 e λx = X 1 Exp(λ). 5. The exponential distribution is used to model failure times in continuous time. It is a continuous waiting time distribution, the continuous analogue of the geometric distribution. 6. If X Uniform(0, 1), and then Y Exp(λ). Y = 1 log(1 X), λ 7. If X Exp(λ), and Y = X 1/α, for α > 0, then Y has a (two-parameter) Weibull distribution, and f Y (y) = αλy α 1 e λyα, y > 0. Definition 4.1.3 GAMMA DISTRIBUTION X Ga(α, β) f X (x) = βα Γ(α) xα 1 e βx, x > 0, α, β > 0, where, for any real number α > 0, the gamma function, Γ(.) is defined by Γ(α) = 0 t α 1 e t dt. NOTES 1. If X 1 Ga(α 1, β), X 2 Ga(α 2, β) are independent random variables, and Y = X 1 + X 2 then Y Ga(α 1 + α 2, β) (directly from properties of mgfs). 2. Ga(1, β) Exp(β). 3. If X 1,..., X n Exp(λ) are independent random variables, and Y = X 1 +... + X n then Y Ga(n, λ) (directly from 1. and 2. ). 4. For α > 0, integrating by parts, we have that Γ(α) = (α 1)Γ(α 1) and hence if α = 1, 2,..., then Γ(α) = (α 1)!. A useful fact is that Γ( 1 2 ) = π.

47 5. Special Case : If α = 1, 2,... the Ga(α/2, 1/2) distribution is also known as the chi-squared distribution with α degrees of freedom, denoted by χ 2 α. 6. If X 1 χ 2 n 1 and X 2 χ 2 n 2 are independent chi-squared random variables with n 1 and n 2 degrees of freedom respectively, then random variable F defined as the ratio F = X 1/n 1 X 2 /n 2 has an F-distribution with (n 1, n 2 ) degrees of freedom. Definition 4.1.4 BETA DISTRIBUTION X Be(α, β) f X (x) = NOTES 1. If α = β = 1, Be(α, β) Uniform(0, 1) Γ(α + β) Γ(α)Γ(β) xα 1 (1 x) β 1, 0 < x < 1, α, β > 0. 2. If X 1 Ga(α 1, β), X 2 Ga(α 2, β) are independent random variables, and Y = X 1 X 1 + X 2 then Y Be(α 1, α 2 ) (using standard multivariate transformation techniques). 3. Suppose that random variables X and Y have a joint probability distribution such that the conditional distribution of X, given Y = y for 0 < y < 1, is binomial, Bin(n, y), and the marginal distribution of Y is beta, Be(α, β), so that f X Y (x y) = ( ) n y x (1 y) n x, x = 0, 1,..., n, x f Y (y) = Γ(α + β) Γ(α)Γ(β) yα 1 (1 y) β 1, 0 < y < 1. Then the marginal distribution of X is given by f X (x) = = 1 0 f X Y (x y)f Y (y)dy ( ) n Γ(α + β) x Γ(α)Γ(β) Γ(x + α)γ(n x + β), x = 0, 1, 2...n. Γ(n + α + β) Note that this provides an example of a joint distribution of continuous Y and discrete X.

48 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS Definition 4.1.5 NORMAL DISTRIBUTION X N(μ, σ 2 ) f X (x) = { 1 exp 1 2πσ 2 2σ 2 (x μ)2 }, x R, μ R, σ > 0. NOTES 1. Special Case : If μ = 0, σ 2 = 1, then X has a standard or unit normal distribution. Usually, the pdf of the standard normal is written φ(x), and the cdf is written Φ(x). 2. If X N(0, 1), and Y = σx + μ then Y N(μ, σ 2 ). Re-expressing this result, if X N(μ, σ 2 ), and Y = (X μ)/σ, then Y N(0, 1) (using transformation or mgf techniques). 3. The Central Limit Theorem Suppose X 1,..., X n are IID random variables with some mgf M X, with E fx [X i ] = μ and V ar fx [X i ] = σ 2 that is, the mgf and the expectation and variance of the X i s are specified, but the pdf is not. Let the standardized random variable Z n be defined by Z n = n X i nμ nσ 2 and let Z n have mgf M Zn. Then, as n, M Zn (t) exp { t 2 /2 }, irrespective of the distribution of the X i s, that is, the distribution of Z n tends to a standard normal distribution as n tends to infinity. This theorem will be proved and explained in Chapter 6. 4. If X N(0, 1), and Y = X 2, then Y χ 2 1, so that the square of a unit normal random variable has a chi-squared distribution with 1 degree of freedom. 5. If X N(0, 1), and Y N(0, 1) are independent random variables, and Z is defined by Z = X/Y, then Z has a Cauchy distribution f Z (z) = 1 π 1 1 + z 2, z R. 6. If X N(0, 1), and Y Ga(n/2, 1/2) for n = 1, 2,... (so that Y χ 2 n), are independent random variables, and T is defined by T = X Y/n then T has a Student-t distribution with n degrees of freedom, T St(n), ( ) n + 1 Γ ( ) 2 1 1/2 { } (n+1)/2 f T (t) = ( n ) 1 + t2, t R. Γ nπ n 2

49 Taking limiting cases of the Student-t distribution n : St(n) N(0, 1), n 1 : St(n) Cauchy. 7. If X 1 N(μ 1, σ 2 1 ) and X 2 N(μ 2, σ 2 2 ) are independent and a, b are constants, then T = ax 1 + bx 2 N(aμ 1 + bμ 2, a 2 σ 2 1 + b2 σ 2 2 ).

50 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

CHAPTER 5 MULTIVARIATE PROBABILITY DISTRIBUTIONS For purely notational reasons, it is convenient in this chapter to consider a random vector X as a column vector, X = (X 1,..., X k ) T, say 5.1 THE MULTINOMIAL DISTRIBUTION The multinomial distribution is a multivariate generalization of the binomial distribution. Recall that the binomial distribution arose from an infinite Urn model with two types of objects being sampled with replacement. Suppose that the proportion of Type 1 objects in the urn is θ (so 0 θ 1) and hence the proportion of Type 2 objects in the urn is 1 θ. Suppose that n objects are sampled, and X is the random variable corresponding to the number of Type 1 objects in the sample. Then X Bin(n, θ), and ( ) n f X (x) = θ x (1 θ) n x, x {0, 1, 2,..., n}. x Now consider a generalization; suppose that the Urn contains k + 1 types of objects (k = 1, 2,...), with θ i being the proportion of Type i objects, for i = 1,..., k + 1. Let X i be the random variable corresponding to the number of type i objects in a sample of size n, for i = 1,..., k. Then the joint distribution of vector X = (X 1,..., X k ) T is given by f X1,...,X k (x 1,..., x k ) = n! x 1!...x k!x k+1! θx 1 1...θx k k θx k+1 k+1 = n! x 1!...x k!x k+1! where 0 θ i 1 for all i, and θ 1 +... + θ k + θ k+1 = 1, and where x k+1 is defined by x k+1 = n (x 1 +... + x k ). This is the mass function for the multinomial distribution which reduces to the binomial if k = 1. It can also be shown that the marginal distribution of X i is Bin(n, θ i ). k+1 θ x i i, 5.2 THE DIRICHLET DISTRIBUTION The Dirichlet distribution is a multivariate generalization of the beta distribution. Recall that the beta distribution arose as follows; suppose that V 1 and V 2 are independent Gamma random variables with V 1 Ga(α 1, β), V 2 Ga(α 2, β). Then if X is defined by X = V 1 / (V 1 + V 2 ), we have that X Be(α 1, α 2 ). Now consider a generalization; suppose that V 1,..., V k+1 are independent Gamma random variables with V i Ga(α i, β), for i = 1,..., k + 1. Define V i X i = V 1 +... + V k+1 for i = 1,..., k. Then the joint distribution of vector X = (X 1,..., X k ) T is given by density f X1,...,X k (x 1,..., x k ) = Γ(α) Γ(α 1 )...Γ(α k )Γ(α k+1 ) xα 1 1 1...x α k 1 k x k+1 α k+1 1, 51

52 CHAPTER 5. MULTIVARIATE PROBABILITY DISTRIBUTIONS for 0 x i 1 for all i such that x 1 +... + x k + x k+1 = 1, where α = α 1 +... + α k+1 and where x k+1 is defined by x k+1 = 1 (x 1 +... + x k ). This is the density function which reduces to the beta distribution if k = 1. It can also be shown that the marginal distribution of X i is Beta(α i, α). 5.3 THE MULTIVARIATE NORMAL DISTRIBUTION The random vector X = (X 1,..., X k ) T has a multivariate normal distribution if the joint pdf is of the form: ( ) 1 k/2 { 1 f X (x 1,..., x k ) = exp 1 } 2π Σ 1/2 2 (x μ)t Σ 1 (x μ). Here x is the (column) vector of length k formed by x 1,..., x k, μ is a (column) vector of length k, Σ is a k k symmetric, positive definite matrix [Σ = Σ T, x T Σx > 0 for all x 0], and Σ denotes the determinant of Σ. We write X N k (μ, Σ). Properties 1. E[X] = μ: μ is the mean vector of X. If μ = (μ 1,..., μ k ) T, we have E[X i ] = μ i. Further, Σ is the variance-covariance matrix of X, Σ = [σ ij ], where σ ij = cov[x i, X j ]. 2. Since Σ is symmetric and positive definite, there exists a matrix Σ 1/2 [the square root of Σ ] such that: (i) Σ 1/2 is symmetric; (ii) Σ = Σ 1/2 Σ 1/2 ; (iii) Σ 1/2 Σ 1/2 = Σ 1/2 Σ 1/2 = I, the k k identity matrix, with Σ 1/2 = (Σ 1/2 ) 1. Then, if Z N k (0, I), so that Z 1,... Z k are IID N(0, 1), and X = μ + Σ 1/2 Z, then X N k (μ, Σ). Conversely, if X N k (μ, Σ), then Σ 1/2 (X μ) N k (0, I). 3. A useful result is the following: if X N k (μ, Σ) and D is a m k matrix of rank m k, then Y DX N m (Dμ, DΣD T ). A special case is where X N k (0, I) and D is a k k matrix of full rank k, so that D is invertible: then Y = DX N k (0, DD T ). 4. Suppose we partition X as with X a = We can similarly partition μ and Σ: X = X 1. X m μ = ( μa μ b ( Xa X b ),, X b = X m+1. X k ) [ Σaa Σ, Σ = ab Σ ba Σ bb. ]. Then if X N k (μ, Σ) we have:

5.3. THE MULTIVARIATE NORMAL DISTRIBUTION 53 (I) The marginal distribution of X a is N m (μ a, Σ aa ). (II) The conditional distribution of X b, given X a = x a is X b X a = x a N k m (μ b + Σ ba Σ 1 aa (x a μ a ), Σ bb Σ ba Σ 1 aa Σ ab ). (III) If a = (a 1,..., a k ) T, then a T X k a i X i N(a T μ, a T Σa). (IV) V = (X μ) T Σ 1 (X μ) χ 2 k.

54 CHAPTER 5. MULTIVARIATE PROBABILITY DISTRIBUTIONS

CHAPTER 6 PROBABILITY RESULTS & LIMIT THEOREMS 6.1 BOUNDS ON PROBABILITIES BASED ON MOMENTS Theorem 6.1.1 If X is a random variable, then for non-negative function h, and c > 0, P [h(x) c] E f X [h(x)]. c Proof. (continuous case) : Suppose that X has density function f X which is positive for x X. Let A = {x X : h(x) c} X. Then, as h(x) c on A, E fx [h(x)] = h(x)f X (x)dx = h(x)f X (x)dx + h(x)f X (x)dx A A h(x)f X (x)dx cf X (x)dx = cp [X A] = cp [h(x) c]. A SPECIAL CASE I - THE MARKOV INEQUALITY : If h(x) = x r for r > 0, so A P [ X r c] 1 c E f X [ X r ]. SPECIAL CASE II - THE CHEBYCHEV INEQUALITY: Suppose that X is a random variable with expectation μ and variance σ 2. Then taking h(x) = (x μ) 2 and c = k 2 σ 2, for k > 0, gives P [ X μ kσ] 1/k 2, and setting ɛ = kσ gives P [ X μ ɛ] σ 2 /ɛ 2, P [ X μ < ɛ] 1 σ 2 /ɛ 2. Theorem 6.1.2 JENSEN S INEQUALITY Suppose that X is a random variable, and function g is convex so that d2 dt 2 {g(t)} t=x = g (x) > 0, x, with Taylor expansion around expectation μ of the form for some x 0 such that x < x 0 < μ. g(x) = g(μ) + (x μ)g (μ) + 1 2 (x μ)2 g (x 0 ), (6.1) Then E fx [g(x)] g(e fx [X]). [ Proof. Taking expectations in (6.1), and noting that E fx [(X μ)] = 0, E fx (X μ) 2 ] = σ 2, g (x 0 ) 0, we have that E fx [g(x)] = g(μ) + ( 0 g (μ) ) + 1 ( σ 2 g (x 0 ) ) g(μ) = g(e fx [X]), 2 as σ 2, g (x 0 ) > 0. 55

56 CHAPTER 6. PROBABILITY RESULTS & LIMIT THEOREMS 6.2 THE CENTRAL LIMIT THEOREM Theorem 6.2.1 Suppose X 1,..., X n are i.i.d. random variables with mgf M X, with E fx [X i ] = μ, Var fx [X i ] = σ 2, both finite. Let the random variable Z n be defined by n X i nμ Z n = nσ 2, and let Z n have mgf M Zn. Then, as n, { } t 2 M Zn (t) exp, 2 irrespective of the form of M X. Proof. First, let Y i = (X i μ)/σ for i = 1,..., n. Then Y 1,..., Y n are i.i.d. with mgf M Y by the elementary properties of expectation, say, and E fy [Y i ] = 0, V ar fy [Y i ] = 1, for each i. Using the power series expansion result for mgfs, we have that M Y (t) = 1 + te fy [Y ] + t2 2! E f Y [Y 2 ] + t3 3! E f Y [Y 3 ] + t4 4! E f Y [Y 4 ] +... = 1 + t2 2! + t3 3! E f Y [Y 3 ] + t4 4! E f Y [Y 4 ] +... Now, the random variable Z n can be rewritten Z n = 1 n n ( ) Xi μ and thus, again by a standard mgf result, as Y 1,..., Y n are independent, we have that M Zn (t) = n { MY (t/ n) } = {1 + t2 2n + σ } n t3 6n 3/2 E f Y [Y 3 ] + t4 6n 2 E f Y [Y 4 ].... Thus, as n, using the properties of the exponential function, which give that if a n a, then ( 1 + a ) n n e a, n we have { } t 2 M Zn (t) exp. 2 INTERPRETATION: Sums of independent and identically distributed random variables have a limiting distribution that is Normal, irrespective of the distribution of the variables.

6.3. MODES OF STOCHASTIC CONVERGENCE 57 6.3 MODES OF STOCHASTIC CONVERGENCE 6.3.1 CONVERGENCE IN DISTRIBUTION Definition 6.3.1 Consider a sequence {X n }, n = 1, 2,..., of random variables and a corresponding sequence of cdfs, F X1, F X2,... so that for n = 1, 2,..., F Xn (x) =P[X n x]. Suppose that there exists a cdf, F X, such that for all x at which F X is continuous, lim n F X n (x) = F X (x). Then the sequence {X n } converges in distribution to the random variable X with cdf F X. This is denoted d X, and F X is the limiting distribution. X n Convergence of a sequence of mgfs also indicates convergence in distribution. That is, if for all t at which M X (t) is defined, if as n, we have then X n d X. M Xn (t) M X (t) Definition 6.3.2 The sequence {X n } of random variables converges in distribution to the constant c if the limiting distribution of X n is degenerate at c, that is, X n d X and P[X = c] = 1, so that F X (x) = { 0, x < c, 1, x c. This special type of convergence in distribution occurs when the limiting distribution is discrete, with the probability mass function only being non-zero at a single value. That is, if the limiting random variable is X, then f X (x) = 1, x = c, and zero otherwise. Theorem 6.3.1 The sequence of random variables {X n } converges in distribution to c if and only if, for all ɛ > 0, lim P [ X n c < ɛ] = 1. n This theorem indicates that convergence in distribution to a constant c occurs if and only if the probability becomes increasingly concentrated around c as n.

58 CHAPTER 6. PROBABILITY RESULTS & LIMIT THEOREMS 6.3.2 CONVERGENCE IN PROBABILITY Definition 6.3.3 CONVERGENCE IN PROBABILITY TO A CONSTANT The sequence of random variables {X n } converges in probability to the constant c, denoted if for all ɛ > 0 X n P c lim P [ X n c < ɛ] = 1, or, equivalently, lim P [ X n c ɛ] = 0, n n that is, if the limiting distribution of X n is degenerate at c. Interpretation. Convergence in probability to a constant is precisely equivalent to convergence in distribution to a constant. A very useful result is Slutsky s Theorem which states that if X n X and Y n d d a finite constant, then: (i) X n + Y n X + c, (ii) X n Y n cx, (iii) X n /Y n d P c, where c is d X/c, if c 0. Theorem 6.3.2 WEAK LAW OF LARGE NUMBERS Suppose that {X n } is a sequence of i.i.d. random variables with expectation μ and variance σ 2. Let Y n be defined by Y n = 1 n X i. n Then, for all ɛ > 0, that is, Y n lim n P [ Y n μ < ɛ] = 1, P μ, and thus the mean of X 1,..., X n converges in probability to μ. Proof. Using the properties of expectation, it can be shown that Y n has expectation μ and variance σ 2 /n, and hence by the Chebychev Inequality, for all ɛ > 0. Hence and Y n P μ. P [ Y n μ ɛ] σ2 0, nɛ2 P [ Y n μ < ɛ] 1, as n as n Definition 6.3.4 CONVERGENCE TO A RANDOM VARIABLE The sequence of random variables {X n } converges in probability to the random variable P X, denoted X n X, if, for all ɛ > 0, lim n P [ X n X < ɛ] = 1, or, equivalently, lim n P [ X n X ɛ] = 0.

6.3. MODES OF STOCHASTIC CONVERGENCE 59 6.3.3 CONVERGENCE IN QUADRATIC MEAN Definition 6.3.5 CONVERGENCE IN QUADRATIC MEAN The sequence of random variables {X n } converges in quadratic mean (also called L 2 qm convergence) to the random variable X, denoted X n X, if E(X n X) 2 0, as n. Theorem 6.3.3 For the sequence {X n } of random variables, (a) Convergence in quadratic mean to a random variable implies convergence in probability: X n qm P X = Xn X. (b) Convergence in probability to a random variable implies convergence in distribution: X n P d X = X n X. Proof. The proof of (a) is simple. Suppose that X n inequality, qm X. Fix ɛ > 0. Then, by Markov s P ( X n X > ɛ) = P ((X n X) 2 > ɛ 2 ) E(X n X) 2 ɛ 2 0. Proof of (b). Fix ɛ > 0 and let x be a continuity point of F X. Then Also, Hence, F Xn (x) = P (X n x, X x + ɛ) + P (X n x, X > x + ɛ) P (X x + ɛ) + P ( X n X > ɛ) = F X (x + ɛ) + P ( X n X > ɛ). F X (x ɛ) = P (X x ɛ) = P (X x ɛ, X n x) + P (X x ɛ, X n > x) F Xn (x) + P ( X n X > ɛ). F X (x ɛ) P ( X n X > ɛ) F Xn (x) F X (x + ɛ) + P ( X n X > ɛ). Take the limit as n to conclude that F X (x ɛ) lim inf F X n n (x) lim sup F Xn (x) F X (x + ɛ). n This holds for all ɛ > 0. Take the limit as ɛ 0 and use the fact that F X is continuous at x to conclude that lim n F X n (x) = F X (x). Note that the reverse implications do not hold. Convergence in probability does not imply convergence in quadratic mean. Also, convergence in distribution does not imply convergence in probability.

60 CHAPTER 6. PROBABILITY RESULTS & LIMIT THEOREMS

CHAPTER 7 STATISTICAL ANALYSIS 7.1 STATISTICAL SUMMARIES Definition 7.1.1 A collection of independent, identically distributed random variables X 1,..., X n each of which has distribution defined by cdf F X (or mass/density function f X ) is a random sample of size n from F X (or f X ). Definition 7.1.2 A function, T, of a random sample, X 1,..., X n, that is, T = t(x 1,..., X n ) that depends only on X 1,..., X n is a statistic. A statistic is a random variable. For example, the sample mean ˉX = X 1 + X 2 +... + X n n is a statistic. But a statistic T need not necessarily be constructed from a random sample (the random variables need not be independent, identically distributed), but that is the case encountered most often. In many circumstances it is necessary to consider statistics constructed from a collection of independent, but not identically distributed, random variables. 7.2 SAMPLING DISTRIBUTIONS Definition 7.2.1 If X 1,..., X n is a random sample from F X, say, and T = t(x 1,..., X n ) is a statistic, then F T (or f T ), the cdf (or mass/density function) of random variable T, is the sampling distribution of T. This notion extends immediately to the case of a statistic T constructed from a general collection of random variables X 1,..., X n. EXAMPLE: If X 1,..., X n are independent random variables, with X i N(μ i, σ 2 i ) for i = 1,..., n, and a 1,..., a n are constants, consider the distribution of random variable Y defined by Y = n a i X i. Using standard mgf results, the distribution of Y is derived to be normal with parameters μ Y = n a i μ i, σ 2 Y = n a 2 i σ 2 i. Now consider the special case of this result when X 1,..., X n are independent, identically distributed with μ i = μ and σ 2 i = σ2, and where a i = 1/n for i = 1,..., n. Then Y = n 1 n X i = ˉX N 61 ) (μ, σ2. n

62 CHAPTER 7. STATISTICAL ANALYSIS Definition 7.2.2 For a random sample X 1,..., X n from a probability distribution, then the sample variance, S 2, is the statistic defined by S 2 = 1 n 1 n ( Xi ˉX ) 2. Theorem 7.2.1 SAMPLING DISTRIBUTION FOR NORMAL SAMPLES If X 1,..., X n is a random sample from a normal distribution, say X i N(μ, σ 2 ), then: (a) ˉX is independent of { X i X, i = 1,..., n } ; (b) ˉX and S 2 are independent random variables; (c) The random variable (n 1)S 2 σ 2 = n ( Xi ˉX ) 2 has a chi-squared distribution with n 1 degrees of freedom. σ 2 Proof. Omitted here. Theorem 7.2.2 Suppose that X 1,..., X n is a random sample from a normal distribution, say X i N(μ, σ 2 ). Then the random variable T = ˉX μ S/ n has a Student-t distribution with n 1 degrees of freedom. Proof. Consider the random variables Z = n( ˉX μ) σ N(0, 1), V = (n 1)S2 σ 2 χ 2 n 1, and T = Z, V n 1 and use the properties of the normal distribution and related random variables (NOTE 6, following Definition 4.1.5). Also, see EXERCISES 5, Q4 (b).

7.3. HYPOTHESIS TESTING 63 7.3 HYPOTHESIS TESTING 7.3.1 TESTING FOR NORMAL SAMPLES - THE Z-TEST We concentrate initially on random data samples that we can assume to have a normal distribution, and utilize the Theorem from the previous section. We will look at two situations, namely one sample and two sample experiments. So, we suppose that X 1,..., X n N ( μ, σ 2) (one sample) and X 1,..., X n N(μ X, σ 2 X ), Y 1,..., Y n N(μ Y, σ 2 Y ) (two sample): in the latter case we assume also independence of the two samples. ONE SAMPLE μ 0 and σ 0. Possible tests of interest are: μ = μ 0, σ = σ 0 for some specified constants TWO SAMPLE Possible tests of interest are: μ X = μ Y, σ X = σ Y. Recall from Theorem 7.2.1 that, if X 1,..., X n N(μ, σ 2 ) are the i.i.d. outcome random variables of n experimental trials, then ) ˉX N (μ, σ2 (n 1) S 2 and n σ 2 χ 2 n 1, with ˉX and S 2 statistically independent. Suppose we want to test the hypothesis that μ = μ 0, for some specified constant μ 0, (for example, μ 0 = 20.0) is a plausible model. More specifically, we want to test H 0 : μ = μ 0, the NULL hypothesis, against H 1 : μ μ 0, the ALTERNATIVE hypothesis. So, we want to test whether H 0 is true, or whether H 1 is true. In the case of a Normal sample, the distribution of ˉX is Normal, and ) ˉX N (μ, σ2 = Z = ˉX μ n σ/ N (0, 1), n where Z is a random variable. Now, when we have observed the data sample, we can calculate x, and therefore we have a way of testing whether μ = μ 0 is a plausible model; we calculate x from x 1,..., x n, and then calculate z = x μ 0 σ/ n. If H 0 is true, and μ = μ 0, then the observed z should be an observation from an N(0, 1) distribution (as Z N(0, 1)), that is, it should be near zero with high probability. In fact, z should lie between -1.96 and 1.96 with probability 1 α = 0.95, say, as P [ 1.96 Z < 1.96] = Φ(1.96) Φ( 1.96) = 0.975 0.025 = 0.95. If we observe z to be outside of this range, then there is evidence that H 0 is not true. So, basically, if we observe an extreme value of z, either H 0 is true, but we have observed a rare event, or we prefer to disbelieve H 0 and conclude that the data contains evidence against H 0. Notice the asymmetry between H 0 and H 1. The null hypothesis is conservative, reflecting perhaps a current state of belief, and we are testing whether the data is consistent with that

64 CHAPTER 7. STATISTICAL ANALYSIS Figure 7.1: CRITICAL REGION IN A Z-TEST (taken from Schaum s ELEMENTS OF STATIS- TICS II, Bernstein & Bernstein. hypothesis, only rejecting H 0 in favour of the alternative H 1 if evidence is clear i.e. when the data represent a rare event under H 0. As an alternative approach, we could calculate the probability p of observing a z value that is more extreme than the z we did observe; this probability is given by p = { 2Φ(z), z < 0, 2(1 Φ(z)), z 0. If this p is very small, say p α = 0.05, then again there is evidence that H 0 is not true. This approach is called significance testing. In summary, we need to assess whether z is a surprising observation from an N(0, 1) distribution - if it is, then we reject H 0. Figure 6.1 depicts the critical region in a Z-test. 7.3.2 HYPOTHESIS TESTING TERMINOLOGY There are five crucial components to a hypothesis test, namely: the TEST STATISTIC; the NULL DISTRIBUTION of the statistic; the SIGNIFICANCE LEVEL of the test, usually denoted by α; the P-VALUE, denoted p; CRITICAL VALUE(S) of the test.

7.3. HYPOTHESIS TESTING 65 In the Normal example given above, we have that: z is the test statistic; The distribution of random variable Z if H 0 is true is the null distribution; α = 0.05 is the significance level of the test (choosing α = 0.01 gives a stronger test); p is the p-value of the test statistic under the null distribution; the solution C R of Φ(C R ) = 1 α/2 gives the critical values of the test ±C R. These critical values define the boundary of a critical region: if the value z is in the critical region we reject H 0. 7.3.3 THE t-test In practice, we will often want to test hypotheses about μ when σ is unknown. We cannot perform the Z-test, as this requires knowledge of σ to calculate the z statistic. Recall that we know the sampling distributions of ˉX and S 2, and that the two estimators are statistically independent. Now, from the properties of the Normal distribution, if we have independent random variables Z N(0, 1) and Y χ 2 ν, then we know that random variable T defined by T = Z Y/ν has a Student-t distribution with ν degrees of freedom. Using this result, and recalling the sampling distributions of ˉX and S 2, we see that T = ˉX μ σ/ n (n 1)S 2 /σ 2 (n 1) = ( ˉX μ) S/ n t n 1 : T has a Student-t distribution with n 1 degrees of freedom, denoted St(n 1), which does not depend on σ 2. Thus we can repeat the procedure used in the σ known case, but use the sampling distribution of T rather than that of Z to assess whether the value of the test statistic is surprising or not. Specifically, we calculate the observed value t = (x μ) s/ n and find the critical values for a α = 0.05 test by finding the ordinates corresponding to the 0.025 and 0.975 percentiles of a Student-t distribution, St(n 1) (rather than a N(0, 1)) distribution. 7.3.4 TEST FOR σ The Z-test and t-test are both tests for the parameter μ. To perform a test about σ, say H 0 : σ = σ 0, H 1 : σ σ 0,

66 CHAPTER 7. STATISTICAL ANALYSIS we construct a test based on the estimate of variance, S 2. In particular, we saw from Theorem 7.2.1 that the random variable Q, defined by Q = (n 1)S2 σ 2 χ 2 n 1, if the data have an N(μ, σ 2 ) distribution. Hence if we define test statistic value q by q = (n 1)s2 σ 0 then we can compare q with the critical values derived from a χ 2 n 1 distribution; we look for the 0.025 and 0.975 quantiles - note that the chi-squared distribution is not symmetric, so we need two distinct critical values. 7.3.5 TWO SAMPLE TESTS It is straightforward to extend the ideas from the previous sections to two sample situations where we wish to compare the distributions underlying two data samples. Typically, we consider sample one, x 1,..., x nx, from a N(μ X, σ 2 X ) distribution, and sample two, y 1,..., y ny, independently from a N(μ Y, σ 2 Y ) distribution, and test the equality of the parameters in the two models. Suppose that the sample mean and sample variance for samples one and two are denoted (x, s 2 X ) and (y, s 2 Y ) respectively. 1. First, consider the hypothesis testing problem defined by H 0 : μ X = μ Y, H 1 : μ X μ Y, when σ X = σ Y = σ is known, so the two samples come from normal distributions with the same, known, variance. Now, from the sampling distributions theorem we have, under H 0 ) ) ( ) ˉX N (μ X, σ2, Ȳ N (μ n Y, σ2, = X n ˉX Ȳ N 0, σ2 + σ2, Y n X n Y since ˉX and Ȳ are independent. Hence by the properties of normal random variables Z = ˉX Ȳ 1 σ + 1 n X n Y N(0, 1), if H 0 is true, giving us a test statistic z defined by z = x y 1 σ + 1, n X n Y which we can compare with the standard normal distribution. If z is a surprising observation from N(0, 1), and lies in the critical region, then we reject H 0. This procedure is the Two Sample Z-Test.

7.3. HYPOTHESIS TESTING 67 2. If we can assume that σ X = σ Y, but the common value, σ, say, is unknown, we parallel the one sample t-test by replacing σ by an estimate in the two sample Z-test. First, we obtain an estimate of σ by pooling the two samples; our estimate is the pooled estimate, s 2 P, defined by s 2 P = (n X 1)s 2 X + (n Y 1)s 2 Y n X + n Y 2 which we then use to form the test statistic t defined by t = x y s P 1 n X + 1 n Y It can be shown that if H 0 is true then t should be an observation from a Student-t distribution with n X + n Y 2 degrees of freedom. Hence we can derive the critical values from the tables of the Student-t distribution. 3. If σ X σ Y, but both parameters are known, we can use a similar approach to the one above to derive a test statistic z defined by x y z =, σ 2 X + σ2 Y n X n Y which has an N(0, 1) distribution if H 0 is true. 4. If σ X σ Y, but both parameters are unknown, we can use a similar approach to the one above to derive a test statistic t defined by x y t =. s 2 X + s2 Y n X n Y The distribution of this statistic when H 0 is true is not analytically available, but can be adequately approximated by a Student (m) distribution, where., with m = (w X + w Y ) 2 ( w 2 ), X n X 1 + w2 Y n Y 1 w X = s2 X n X, w Y = s2 Y n Y. Clearly, the choice of test we use depends on whether σ X = σ Y hypothesis formally; to test H 0 : σ X = σ Y, H 1 : σ X σ Y, we compute the test statistic Q = S2 X SY 2, or not. We may test this

68 CHAPTER 7. STATISTICAL ANALYSIS which has as null distribution the F distribution with (n X 1, n Y 1) degrees of freedom. This distribution can be denoted F (n X 1, n Y 1), and its quantiles are tabulated. Hence we can look up the 0.025 and 0.975 quantiles of this distribution (the F distribution is not symmetric), and hence define the critical region. Informally, if the test statistic value q is very small or very large, then it is a surprising observation from the F distribution and hence we reject the hypothesis of equal variances. HYPOTHESIS TESTING SUMMARY In general, to test a hypothesis H 0, we consider a statistic calculated from the sample data. We derive mathematically the probability distribution of the statistic, considered as a random variable, when the hypothesis H 0 is true, and compare the actual observed value of the statistic computed from the data sample with the hypothetical probability distribution. We ask the question Is the value a likely observation from this probability distribution? If the answer is No, then reject the hypothesis, otherwise accept it. 7.4 POINT ESTIMATION Definition 7.4.1 Let X 1,..., X n be a random sample from a distribution with mass/density function f X that depends on a (possibly vector) parameter θ. Then f X1 (x 1 ) = f X (x 1 ; θ), so that f X1,...,X k (x 1,..., x k ) = k f X (x i ; θ). A statistic T = t(x 1,..., X n ) that is used to represent or estimate a function τ(θ) of θ based on an observed sample of the random variables x 1,..., x n is an estimator, and t = t(x 1,..., x n ) is an estimate, ˆτ(θ), of τ(θ). The estimator T = t(x 1,..., X n ) is said to be unbiased if E(T ) = θ, otherwise T is biased. 7.4.1 ESTIMATION TECHNIQUES I: METHOD OF MOMENTS Suppose that X 1,..., X n is a random sample from a probability distribution with mass/density function f X that depends on a vector parameter θ of dimension k, and suppose that a sample x 1,..., x n has been observed. Let the rth moment of f X be denoted μ r, and let the rth sample moment, denoted m r be defined for r = 1, 2,... by Then m r is an estimate of μ r, and is an estimator of μ r. m r = 1 n M j = 1 n n x r i. n X r i PROCEDURE : The method of moments technique of estimation involves matching the theoretical moments μ r μ r (θ) to the sample moments m r, r = 1, 2,..., l, for suitable l, and solving for θ. In most situations taking l = k, the dimension of θ, suffices: we obtain k equations in the k elements of vector θ which may be solved simultaneously to find the parameter estimates.

7.4. POINT ESTIMATION 69 We may, however, need l > k. Intuitively, and recalling the Weak Law of Large Numbers, it is reasonable to suppose that there is a close relationship between the theoretical properties of a probability distribution and estimates derived from a large sample. For example, we know that, for large n, the sample mean converges in probability to the theoretical expectation. 7.4.2 ESTIMATION TECHNIQUES II: MAXIMUM LIKELIHOOD Definition 7.4.2 Let random variables X 1,..., X n have joint mass or density function, denoted f X1,...,X k, that depends on a vector parameter θ = (θ 1,..., θ k ). Then the joint/mass density function considered as a function of θ for the (fixed) observed values x 1,..., x n of the variables is the likelihood function, L(θ): L(θ) = f X1,...,X n (x 1,..., x n ; θ). If X 1,..., X n represents a random sample from joint/mass density function f X L(θ) = n f X (x i ; θ). Definition 7.4.3 Let L(θ) be the likelihood function derived from the joint/mass density function of random variables X 1,..., X n, where θ Θ R k, say, and Θ is termed the parameter space. Then for a fixed set of observed values x 1,..., x n of the variables, the estimate of θ termed the maximum likelihood estimate (MLE) of θ, θ, is defined by θ = arg max L(θ). θ Θ That is, the maximum likelihood estimate is the value of θ for which L(θ) is maximized in the parameter space Θ. DISCUSSION : The method of estimation involves finding the value of θ for which L(θ) is maximized. This is generally done by setting the first partial derivatives of L(θ) with respect to θ j equal to zero, for j = 1,..., k, and solving the resulting k simultaneous equations. But we must be alert to cases where the likelihood function L(θ) is not differentiable, or where the maximum occurs on the boundary of Θ! Typically, it is easier to obtain the MLE by maximising the (natural) logarithm of L(θ): we maximise l(θ) = log L(θ), the log-likelihood. THE FOUR STEP ESTIMATION PROCEDURE: Suppose a sample x 1,..., x n has been obtained from a probability model specified by mass or density function f X (x; θ) depending on parameter(s) θ lying in parameter space Θ. The maximum likelihood estimate is produced as follows; 1. Write down the likelihood function, L(θ). 2. Take the natural log of the likelihood, collect terms involving θ. 3. Find the value of θ Θ, θ, for which log L(θ) is maximized, for example by differentiation. Note that, if parameter space Θ is a bounded interval, then the maximum likelihood estimate may lie on the boundary of Θ. If the parameter is a k vector, the maximization involves evaluation of partial derivatives.

70 CHAPTER 7. STATISTICAL ANALYSIS 4. Check that the estimate θ obtained in STEP 3 truly corresponds to a maximum in the (log) likelihood function by inspecting the second derivative of log L(θ) with respect to θ. In the single parameter case, if the second derivative of the log -likelihood is negative at θ = θ, then θ is confirmed as the MLE of θ (other techniques may be used to verify that the likelihood is maximized at ˆθ). 7.5 INTERVAL ESTIMATION The techniques of the previous section provide just a single point estimate of the value of an unknown parameter. Instead, we can attempt to provide a set or interval of values which expresses our uncertainty over the unknown value. Definition 7.5.1 Let X = (X 1,..., X n ) be a random sample from a distribution depending on an unknown scalar parameter θ. Let T 1 = l 1 (X 1,..., X n ) and T 2 = l 2 (X 1,..., X n ) be two statistics satisfying T 1 T 2 for which P (T 1 < θ < T 2 ) = 1 α, where P denotes probability when X has distribution specified by parameter value θ, whatever the true value of θ, and where 1 α does not depend on θ. Then the random interval (T 1, T 2 ) is called a 1 α confidence interval for θ, and T 1 and T 2 are called lower and upper confidence limits, respectively. Given data, x = (x 1,..., x n ), the realised value of X, we calculate the interval (t 1, t 2 ), where t 1 = l 1 (x 1,..., x n ) and t 2 = l 2 (x 1,..., x n ). NOTE: θ here is fixed (it has some true, fixed but unknown value in Θ), and it is T 1, T 2 that are random. The calculated interval (t 1, t 2 ) either does, or does not, contain the true value of θ. Under repeated sampling of X, the random interval (T 1, T 2 ) contains the true value a proportion 1 α of the time. Typically, we will take α = 0.05 or α = 0.01, corresponding to a 95% or 99% confidence interval. If θ is a vector, then we use a confidence set, such as a sphere or ellipse, instead of an interval. So, C(X) is a 1 α confidence set for θ if P (θ C(X)) = 1 α, for all possible θ. The key problem is to develop methods for constructing a confidence interval. There are two general procedures. 7.5.1 PIVOTAL QUANTITY Definition 7.5.2 Let X = (X 1,..., X n ) be a random sample from a distribution specified by (scalar) parameter θ. Let Q = q(x; θ) be a function of X and θ. If Q has a distribution that does not depend on θ, then Q is defined to be a pivotal quantity. If Q = q(x; θ) is a pivotal quantity and has a probability density function, then for any fixed 0 < α < 1, there will exist q 1 and q 2 depending on α such that P (q 1 < Q < q 2 ) = 1 α. Then, if for each possible sample value x = (x 1,..., x n ), q 1 < q(x 1,..., x n ; θ) < q 2 if and only if l 1 (x 1,..., x n ) < θ < l 2 (X 1,..., x n ), for functions l 1 and l 2, not depending on θ, then (T 1, T 2 ) is a 1 α confidence interval for θ, where T i = l i (X 1,..., X n ), i = 1, 2.

7.5. INTERVAL ESTIMATION 71 7.5.2 INVERTING A TEST STATISTIC A second method utilises a correspondence between hypothesis testing and interval estimation. Definition 7.5.3 For each possible value θ 0, let A(θ 0 ) be the acceptance region of a test of H 0 : θ = θ 0, of significance level α, so that A(θ 0 ) is the set of data values x such that H 0 is accepted in a test of significance level α. For each x, define a set C(x) by C(x) = {θ 0 : x A(θ 0 )}. Then the random set C(X) is a 1 α confidence set. Example. Let X = (X 1,..., X n ), with the X i IID N(μ, σ 2 ), with σ 2 known. Both the above procedures yield a 1 α confidence interval for μ of the form ( ˉX z α/2 σ/ n, ˉX + z α/2 σ/ n), where Φ(z α/2 ) = 1 α/2. We know that and so is a pivotal quantity. We have ˉX μ σ/ n N(0, 1), Rearranging gives which defines a 1 α confidence interval. P ( z α/2 < ˉX μ σ/ n < z α/2) = 1 α. P ( ˉX z α/2 σ/ n < μ < ˉX + z α/2 σ/ n) = 1 α, If we test H 0 : μ = μ 0 against H 1 : μ μ 0 a reasonable test has rejection region of the form {x : ˉx μ 0 > z α/2 σ/ n}. So, H 0 is accepted for sample points with ˉx μ 0 < z α/2 σ/ n, or, ˉx z α/2 σ/ n < μ 0 < ˉx + z α/2 σ/ n. The test is constructed to have size (significance level) α, so P (H 0 is accepted μ = μ 0 ) = 1 α. So we can write This is true for every μ 0, so P ( ˉX z α/2 σ/ n < μ 0 < ˉX + z α/2 σ/ n μ = μ 0 ) = 1 α. P ( ˉX z α/2 σ/ n < μ < ˉX + z α/2 σ/ n) = 1 α is true for every μ, so the confidence interval obtained by inverting the test is the same as that derived by the pivotal quantity method.