Lecture 21. The Multivariate Normal Distribution



Similar documents
Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Efficiency and the Cramér-Rao Inequality

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Principle of Data Reduction

Inner Product Spaces

Some probability and statistics

Vector and Matrix Norms

Data Mining: Algorithms and Applications Matrix Math Review

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Basics of Statistical Machine Learning

Introduction to General and Generalized Linear Models

Maximum Likelihood Estimation

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Linear Algebra Notes for Marsden and Tromba Vector Calculus

1 Prior Probability and Posterior Probability

Section 5.1 Continuous Random Variables: Introduction

Master s Theory Exam Spring 2006

Practice problems for Homework 11 - Point Estimation

Similarity and Diagonalization. Similar Matrices

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Chapter 17. Orthogonal Matrices and Symmetries of Space

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

5. Continuous Random Variables

Multivariate normal distribution and testing for means (see MKB Ch 3)

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

1 Sufficient statistics

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Covariance and Correlation

Math 461 Fall 2006 Test 2 Solutions

2WB05 Simulation Lecture 8: Generating random variables

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

4. Continuous Random Variables, the Pareto and Normal Distributions

1 Introduction to Matrices

Probability density function : An arbitrary continuous random variable X is similarly described by its probability density function f x = f X

Statistics 100A Homework 8 Solutions

Probability Generating Functions

Lecture 13: Martingales

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

The Bivariate Normal Distribution

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

STAT 830 Convergence in Distribution

Linear Algebra Review. Vectors

Sections 2.11 and 5.8

Least Squares Estimation

October 3rd, Linear Algebra & Properties of the Covariance Matrix

Nonparametric Statistics

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Probability Calculator

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Lecture 3: Linear methods for classification

Continued Fractions and the Euclidean Algorithm

3.4 Statistical inference for 2 populations based on two samples

E3: PROBABILITY AND STATISTICS lecture notes

Joint Exam 1/P Sample Exam 1

NOV /II. 1. Let f(z) = sin z, z C. Then f(z) : 3. Let the sequence {a n } be given. (A) is bounded in the complex plane

1 Lecture: Integration of rational functions by decomposition

An Introduction to Basic Statistics and Probability

Numerical Analysis Lecture Notes

Econometrics Simple Linear Regression

5. Orthogonal matrices

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

An Introduction to Machine Learning

The Determinant: a Means to Calculate Volume

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Lecture Notes 1. Brief Review of Basic Probability

I. GROUPS: BASIC DEFINITIONS AND EXAMPLES

Solving Linear Systems, Continued and The Inverse of a Matrix

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

4 Sums of Random Variables

v w is orthogonal to both v and w. the three vectors v, w and v w form a right-handed set of vectors.

Introduction to Matrix Algebra

Multivariate Normal Distribution Rebecca Jennings, Mary Wakeman-Linn, Xin Zhao November 11, 2010

Chapter 4 Lecture Notes

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Linear Algebra I. Ronald van Luijk, 2012

BANACH AND HILBERT SPACE REVIEW

Correlation in Random Variables

F Matrix Calculus F 1

General Sampling Methods

Chapter 3 RANDOM VARIATE GENERATION

Notes on Continuous Random Variables

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Permutation Tests for Comparing Two Populations

Section Inner Products and Norms

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

( ) is proportional to ( 10 + x)!2. Calculate the

Math 4310 Handout - Quotient Vector Spaces

Transcription:

Lecture. The Multivariate Normal Distribution. Definitions and Comments The joint moment-generating function of X,...,X n [also called the moment-generating function of the random vector (X,...,X n )] is defined by M(t,...,t n )=E[exp(t X + + t n X n )]. Just as in the one-dimensional case, the moment-generating function determines the density uniquely. The random variables X,...,X n are said to have the multivariate normal distribution or to be jointly Gaussian (we also say that the random vector (X,...,X n ) is Gaussian) if M(t,...,t n ) = exp(t µ + + t n µ n ) exp t i a ij t j where the t i and µ j are arbitrary real numbers, and the matrix A is symmetric and positive definite. Before we do anything else, let us indicate the notational scheme we will be using. Vectors will be written with an underbar, and are assumed to be column vectors unless otherwise specified. If t is a column vector with components t,...,t n, then to save space we write t =(t,...,t n ). The row vector with these components is the transpose of t, written t. The moment-generating function of jointly Gaussian random variables has the form ( ) M(t,...,t n ) = exp(t µ) exp t At. i,j= We can describe Gaussian random vectors much more concretely.. Theorem Joint Gaussian random variables arise from nonsingular linear transformations on independent normal random variables. Proof. Let X,...,X n be independent, with X i normal (0,λ i ), and let X =(X,...,X n ). Let Y = BX+µ where B is nonsingular. Then Y is Gaussian, as can be seen by computing the moment-generating function of Y : But E[exp(u X)] = M Y (t) =E[exp(t Y )] = E[exp(t BX)] exp(t µ). n E[exp(u i X i )] = exp ( n λ i u i / ) = exp ( u Du ) i= i=

where D is a diagonal matrix with λ i s down the main diagonal. Set u = B t,u = t B; then M Y (t) = exp(t µ) exp( t BDB t) and BDB is symmetric since D is symmetric. Since t BDB t = u Du, which is greater than 0 except when u =0(equivalently when t =0because B is nonsingular), BDB is positive definite, and consequently Y is Gaussian. Conversely, suppose that the moment-generating function of Y is exp(t µ) exp[(/)t At)] where A is symmetric and positive definite. Let L be an orthogonal matrix such that L AL = D, where D is the diagonal matrix of eigenvalues of A. Set X = L (Y µ), so that Y = µ + LX. The moment-generating function of X is E[exp(t X)] = exp( t L µ)e[exp(t L Y )]. The last term is the moment-generating function of Y with t replaced by t L, or equivalently, t replaced by Lt. Thus the moment-generating function of X becomes exp( t L µ) exp(t L µ) exp ( t L ALt ) This reduces to exp ( t Dt ) = exp ( λ i t ) i. Therefore the X i are independent, with X i normal (0,λ i ). i=.3 A Geometric Interpretation Assume for simplicity that the random variables X i have zero mean. If E(U) =E(V )=0 then the covariance of U and V is E(UV), which can be regarded as an inner product. Then Y µ,...,y n µ n span an n-dimensional space, and X,...,X n is an orthogonal basis for that space. We will see later in the lecture that orthogonality is equivalent to independence. (Orthogonality means that the X i are uncorrelated, i.e., E(X i X j ) = 0 for i j.).4 Theorem Let Y = µ + LX as in the proof of (.), and let A be the symmetric, positive definite matrix appearing in the moment-generating function of the Gaussian random vector Y. Then E(Y i )=µ i for all i, and furthermore, A is the covariance matrix of the Y i, in other words, a ij =Cov(Y i,y j ) (and a ii =Cov(Y i,y i )=VarY i ). It follows that the means of the Y i and their covariance matrix determine the momentgenerating function, and therefore the density.

3 Proof. Since the X i have zero mean, we have E(Y i )=µ i. Let K be the covariance matrix of the Y i. Then K can be written in the following peculiar way: Y µ K = E. (Y µ,...,y n µ n ). Y n µ n Note that if a matrix M is n by and a matrix N isbyn, then MN is n by n. In this case, the ij entry is E[(Y i µ i )(Y j µ j )] = Cov(Y i,y j ). Thus K = E[(Y µ)(y µ) ]=E(LXX L )=LE(XX )L since expectation is linear. [For example, E(MX)=ME(X) because E( j m ijx j )= j m ije(x j ).] But E(XX ) is the covariance matrix of the X i, which is D. Therefore K = LDL = A (because L AL = D)..5 Finding the Density From Y = µ + LX we can calculate the density of Y. The Jacobian of the transformation from X to Y is det L = ±, and f X (x,...,x n )= ( exp ( π) n λ λ n x ) i /λ i. We have λ λ n = det D = det K because det L = det L = ±. Thus f X (x,...,x n )= i= ( π) n det K exp ( x D x ). But y = µ + Lx, x = L (y µ), x D x =(y µ) LD L (y µ), and [see the end of (.4)] K = LDL, K = LD L. The density of Y is f Y (y,...,y n )= ( π) n det K exp [ (y µ) K (y µ ) ]..6 Individually Gaussian Versus Jointly Gaussian If X,...,X n are jointly Gaussian, then each X i is normally distributed (see Problem 4), but not conversely. For example, let X be normal (0,) and flip an unbiased coin. If the coin shows heads, set Y = X, and if tails, set Y = X. Then Y is also normal (0,) since P {Y y} = P {X y} + P { X y} = P {X y} because X is also normal (0,). Thus F X = F Y. But with probability /, X +Y =X, and with probability /, X + Y = 0. Therefore P {X + Y =0} =/. If X and Y were jointly Gaussian, then X + Y would be normal (Problem 4). We conclude that X and Y are individually Gaussian but not jointly Gaussian.

4.7 Theorem If X,...,X n are jointly Gaussian and uncorrelated (Cov(X i,x j ) = 0 for all i j), then the X i are independent. Proof. The moment-generating function of X =(X,...,X n )is M X (t) = exp(t µ) exp ( t Kt ) where K is a diagonal matrix with entries σ,σ,...,σn down the main diagonal, and 0 s elsewhere. Thus n M X (t) = exp(t i µ i ) exp ( σ i t ) i i= which is the joint moment-generating function of independent random variables X,...,X n, whee X i is normal (µ i,σ i )..8 A Conditional Density Let X,...,X n be jointly Gaussian. We find the conditional density of X n given X,...,X n : f(x n x,...,x n )= f(x,...,x n ) f(x,...,x n ) with f(x,...,x n )=(π) n/ (det K) / exp [ ] y i q ij y j i,j= where Q = K =[q ij ],y i = x i µ i. Also, Now f(x,...,x n )= y i q ij y j = i,j= n i,j= f(x,...,x n,x n ) dx n = B(y,...,y n ). n n y i q ij y j + y n q nj y j + y n q in y i + q nn yn. j= i= Thus the conditional density has the form A(y,...,y n ) B(y,...,y n ) exp[ (Cy n + D(y,...,y n )y n ] with C =(/)q nn, D = n j= q njy j = n i= q iny i since Q = K is symmetric. The conditional density may now be expressed as A B exp ( D [ ) exp C(y n + D ]. 4C C )

5 We conclude that given X,...,X n, X n is normal. The conditional variance of X n (the same as the conditional variance of Y n = X n µ n )is C = q nn because σ = C, σ = C. Thus and the conditional mean of Y n is so the conditional mean of X n is Var(X n X,...,X n )= q nn D C = n q nj Y j q nn j= E(X n X,...,X n )=µ n n q nj (X j µ j ). q nn Recall from Lecture 8 that E(Y X) is the best estimate of Y based on X, in the sense that the mean square error is minimized. In the joint Gaussian case, the best estimate of X n based on X,...,X n is linear, and it follows that the best linear estimate is in fact the best overall estimate. This has important practical applications, since linear systems are usually much easier than nonlinear systems to implement and analyze. Problems. Let K be the covariance matrix of arbitrary random variables X,...,X n. Assume that K is nonsingular to avoid degenerate cases. Show that K is symmetric and positive definite. What can you conclude if K is singular?. If X is a Gaussian n-vector and Y = AX with A nonsingular, show that Y is Gaussian. 3. If X,...,X n are jointly Gaussian, show that X,...,X m are jointly Gaussian for m n. 4. If X,...,X n are jointly Gaussian, show that c X + + c n X n is a normal random variable (assuming it is nondegenerate, i.e., not identically constant). j=

6 Lecture. The Bivariate Normal Distribution. Formulas The general formula for the n-dimensional normal density is f X (x,...,x n )= ( exp [ π) n det K (x µ) K (x µ) ] where E(X) =µ and K is the covariance matrix of X. We specialize to the case n =: [ ] [ ] σ K = σ σ σ σ = ρσ σ ρσ σ σ, σ =Cov(X,X ); [ ] K σ = ρσ σ σ σ ( ρ ) ρσ σ σ Thus the joint density of X and X is { πσ σ ρ exp ( ρ ) = ρ [ /σ ρ/σ σ ρ/σ σ /σ ]. [ ( x µ σ ) ρ (x µ σ )(x µ σ ) + (x µ σ ) The moment-generating function of X is M X (t,t ) = exp(t µ) exp ( t Kt ) [ = exp t µ + t µ + σ ( t +ρσ σ t t + σt ) ]. If X and X are jointly Gaussian and uncorrelated, then ρ = 0, so that f(x,x ) is the product of a function g(x )ofx alone and a function h(x )ofx alone. It follows that X and X are independent. (We proved independence in the general n-dimensional case in Lecture.) From the results at the end of Lecture, the conditional distribution of X given X is normal, with E(X X = x )=µ q (x µ ) q where q = ρ/σ σ q /σ = ρσ. σ Thus E(X X = x )=µ + ρσ (x µ ) σ and Var(X X = x )= = σ q ( ρ ). For E(X X = x ) and Var(X X = x ), interchange µ and µ, and interchange σ and σ. ]}.

. Example Let X be the height of the father, Y the height of the son, in a sample of father-son pairs. Assume X and Y bivariate normal, as found by Karl Pearson around 900. Assume E(X) = 68 (inches), E(Y ) = 69, σ X = σ Y =,ρ =.5. (We expect ρ to be positive because on the average, the taller the father, the taller the son.) Given X = 80 (6 feet 8 inches), Y is normal with mean µ Y + ρσ Y (x µ X )=69+.5(80 68) = 75 σ X which is 6 feet 3 inches. The variance of Y given X =80is σ Y ( ρ ) = 4(3/4) = 3. Thus the son will tend to be of above average height, but not as tall as the father. This phenomenon is often called regression, and the line y = µ Y +(ρσ Y /σ X )(x µ X ) is called the line of regression or the regression line. Problems. Let X and Y have the bivariate normal distribution. The following facts are known: µ X =,σ X =, and the best estimate of Y based on X, i.e., the estimate that minimizes the mean square error, is given by 3X +7. The minimum mean square error is 8. Find µ X,σ Y and the correlation coefficient ρ between X and Y.. Show that the bivariate normal density belongs to the exponential class, and find the corresponding complete sufficient statistic. 7

8 Lecture 3. Cramér-Rao Inequality 3. A Strange Random Variable Given a density f θ (x), <x<, a<θ<b. We have found maximum likelihood estimates by computing θ ln f θ(x). If we replace x by X, we have a random variable. To see what is going on, let s look at a discrete example. If X takes on values x,x,x 3,x 4 with p(x )=.5,p(x )=p(x 3 )=.,p(x 4 )=., then p(x) is a random variable with the following distribution: P {p(x) =.5} =.5, P{p(X) =.} =.4, P{p(X) =.} =. For example, if X = x then p(x) =p(x )=., and if X = x 3 then p(x) =p(x 3 )=.. The total probability that p(x) =. is.4. The continuous case is, at first sight, easier to handle. If X has density f and X = x, then f(x) =f(x). But what is the density of f(x)? We will not need the result, but the question is interesting and is considered in Problem. The following two lemmas will be needed to prove the Cramér-Rao inequality, which can be used to compute uniformly minimum variance unbiased estimates. In the calculations to follow, we are going to assume that all differentiations under the integral sign are legal. 3. Lemma Proof. The expectation is E θ [ θ ln f θ(x) ] =0. [ θ ln f θ(x) ] f θ (x) dx = f θ (x) f θ (x) f θ (x) dx θ which reduces to f θ (x) dx = θ θ ()=0. 3.3 Lemma Let Y = g(x) and assume E θ (Y )=k(θ). If k (θ) =dk(θ)/dθ, then Proof. We have k (θ) =E θ [ Y θ ln f θ(x) ]. k (θ) = θ E θ[g(x)] = θ g(x)f θ (x) dx = g(x) f θ(x) θ dx

9 = g(x) f θ(x) θ f θ (x) f θ(x) dx = g(x) [ θ ln f θ(x) ] f θ (x) dx = E θ [g(x) θ ln f θ(x)] = E θ [Y θ ln f θ(x)]. 3.4 Cramér-Rao Inequality Under the assumptions of (3.3), we have Var θ Y Proof. By the Cauchy-Schwarz inequality, hence [k (θ)] E θ [( θ ln f θ(x) ) ]. [Cov(V,W)] =(E[(V µ V )(W µ W )]) Var V Var W [Cov θ (Y, θ ln f θ(x))] Var θ Y Var θ Since E θ [( / θ)lnf θ (X)] = 0 by (3.), this becomes θ ln f θ(x). (E θ [Y θ ln f θ(x)]) Var θ YE θ [( θ ln f θ(x)) ]. By (3.3), the left side is [k (θ)], and the result follows. 3.5 A Special Case Let X,...,X n be iid, each with density f θ (x), and take X = (X,...,X n ). Then f θ (x,...,x n )= n i= f θ(x i ) and by (3.), [( E θ θ ln f θ(x) ) ] =Varθ θ ln f θ(x) =Var θ i= θ ln f θ(x i ) 3.6 Theorem = n Var θ θ ln f θ(x i )=ne θ [( θ ln f θ(x i ) ) ]. Let X,...,X n be iid, each with density f θ (x). If Y = g(x,...,x n ) is an unbiased estimate of θ, then Var θ Y ne θ [( θ ln f θ(x i ) ) ].

0 Proof. Applying (3.5), we have a special case of the Cramér-Rao inequality (3.4) with k(θ) =θ, k (θ) =. The lower bound in (3.6) is /ni(θ), where [( I(θ) =E θ θ ln f θ(x i ) ) ] is called the Fisher information. It follows from (3.6) that if Y is an unbiased estimate that meets the Cramér-Rao inequality for all θ (an efficient estimate), then Y must be a UMVUE of θ. 3.7 A Computational Simplification From (3.) we have Differentiate again to obtain Thus ( θ ln f θ(x) ) f θ (x) dx =0. ln f θ (x) θ f θ (x) dx + ln f θ (x) θ f θ (x) dx + ln f θ (x) f θ (x) dx =0. θ θ ln f θ (x)[ f θ (x) θ θ f θ (x) ] fθ (x) dx =0. But the term in brackets on the right is ln f θ (x)/ θ, sowehave ln f θ (x) ( θ f θ (x) dx + θ ln f θ(x) ) fθ (x) dx =0. Therefore [( I(θ) =E θ θ ln f θ(x i ) ) ] [ ln f θ (X i )] = Eθ. θ Problems. If X is a random variable with density f(x), explain how to find the distribution of the random variable f(x).. Use the Cramér-Rao inequality to show that the sample mean is a UMVUE of the true mean in the Bernoulli, normal (with σ known) and Poisson cases.

Lecture 4. Nonparametric Statistics We wish to make a statistical inference about a random variable X even though we know nothing at all about its underlying distribution. 4. Percentiles Assume F continuous and strictly increasing. If 0 < p <, then the equation F (x) = p has a unique solution ξ p, so that P {X ξ p } = p. When p =/, ξ p is the median; when p =.3, ξ p is the 30-th percentile, and so on. Let X,...,X n be iid, each with distribution function F, and let Y,...,Y n be the order statistics. We will consider the problem of estimating ξ p. 4. Point Estimates On the average, np of the observations will be less than ξ p. (We have n Bernoulli trials, with probability of success P {X i <ξ p } = F (ξ p )=p.) It seems reasonable to use Y k as an estimate of ξ p, where k is approximately np. We can be a bit more precise. The random variables F (X ),...,F(X n ) are iid, uniform on (0,) [see (8.5)]. Thus F (Y ),...,F(Y n ) are the order statistics from a uniform (0,) sample. We know from Lecture 6 that the density of F (Y k )is n! (k )!(n k)! xk ( x) n k, 0 <x<. Therefore n! E[F (Y k )] = 0 (k )!(n k)! xk ( x) n k n! dx = β(k +,n k +). (k )!(n k)! Now β(k +,n k +) = Γ(k +)Γ(n k +)/Γ(n +) = k!(n k)!/(n +)!, and consequently E[F (Y k )] = k n +, k n. Define Y 0 = and Y n+ =, so that E[F (Y k+ ) F (Y k )] = n +, 0 k n. (Note that when k = n, the expectation is [n/(n +)] = /(n +), as asserted.) The key point is that on the average, each [Y k,y k+ ] produces area /(n +) under the density f of the X i. This is true because Yk+ f(x) dx = F (Y k+ ) F (Y k ) Y k and we have just seen that the expectation of this quantity is /(n +), k =0,,...,n. If we want to accumulate area p, set k/(n +) = p, that is, k =(n +)p.

Conclusion: If (n +)p is an integer, estimate ξ p by Y (n+)p. If (n +)p is not an integer, we can use a weighted average. For example, if p =.6 and n = 3 then (n +)p =4.6 =8.4. Now if (n +)p were 8, we would use Y 8, and if (n +)p were 9 we would use Y 9. If (n +)p =8+λ, we use ( λ)y 8 + λy 9. In the present case, λ =.4, so we use.6y 8 +.4Y 9 = Y 8 +.4(Y 9 Y 8 ). 4.3 Confidence Intervals Select order statistics Y i and Y j, where i and j are (approximately) symmetrical about (n +)p. Then P {Y i <ξ p <Y j } is the probability that the number of observations less than ξ p is at least i but less than j, i.e., between i and j, inclusive. The probability that exactly k observations will be less than ξ p is ( n k) p k ( p) n k, hence j ( ) n P {Y i <ξ p <Y j } = p k ( p) n k. k k=i Thus (Y i,y j ) is a confidence interval for ξ p, and we can find the confidence level by evaluating the above sum, possibly with the aid of the normal approximation to the binomial. 4.4 Hypothesis Testing First let s look at a numerical example. The 30-th percentile ξ.3 will be less than 68 precisely when F (ξ.3 ) <F(68), because F is continuous and strictly increasing. Therefore ξ.3 < 68 iff F (68) >.3. Similarly, ξ.3 > 68 iff F (68) <.3, and ξ.3 =68iffF (68) =.3. In general, and ξ p0 <ξ F (ξ) >p 0, ξ p0 >ξ F (ξ) <p 0 ξ p0 = ξ F (ξ) =p 0. In our numerical example, if F (68) were actually.4, then on the average, 40 percent of the observations will be 68 or less, as opposed to 30 percent if F (68) =.3. Thus a larger than expected number of observations less than or equal to 68 will tend to make us reject the hypothesis that the 30-th percentile is exactly 68. In general, our problem will be H 0 : ξ p0 = ξ ( F (ξ) =p 0 ) H : ξ p0 <ξ ( F (ξ) >p 0 ) where p 0 and ξ are specified. If Y is the number of observations less than or equal to ξ, we propose to reject H 0 if Y c. (If H is ξ p0 >ξ, i.e., F (ξ) <p 0, we reject if Y c.) Note that Y is the number of nonpositive signs in the sequence X ξ,...,x n ξ, and for this reason, the terminology sign test is used.

3 Since we are trying to determine whether F (ξ) is equal to p 0 or greater than p 0,we may regard θ = F (ξ) as the unknown state of nature. The power function of the test is K(θ) =P θ {Y c} = k=c ( ) n θ k ( θ) n k k and in particular, the significance level (probability of a type error) is α = K(p 0 ). The above confidence interval estimates and the sign test are distribution free, that is, independent of the underlying distribution function F. Problems are deferred to Lecture 5.

4 Lecture 5. The Wilcoxon Test We will need two formulas: k = k= n(n +)(n +), 6 [ ] n(n +) k 3 =. k= For a derivation via the calculus of finite differences, see my on-line text A Course in Commutative Algebra, Section 5.. The hypothesis testing problem addressed by the Wilcoxon test is the same as that considered by the sign test, except that: () We are restricted to testing the median ξ.5. () We assume that X,...,X n are iid and the underlying density is symmetric about the median (so we are not quite nonparametric). There are many situations where we suspect an underlying normal distribution but are not sure. In such cases, the symmetry assumption may be reasonable. (3) We use the magnitudes as well as the signs of the deviations X i ξ.5, so the Wilcoxon test should be more accurate than the sign test. 5. How The Test Works Suppose we are testing H 0 : ξ.5 = m vs. H : ξ.5 >mbased on observations X,...,X n. We rank the absolute values X i m from smallest to largest. For example, let n =5 and X m =.7,X m =.3,X 3 m = 0.3,X 4 m = 3.,X 5 m =.4. Then X 3 m < X m < X 5 m < X m < X 4 m. Let R i be the rank of X i m, so that R 3 =,R =,R 5 =3,R =4,R 4 = 5. Let Z i be the sign of X i m, so that Z i = ±. Then Z 3 =,Z =,Z 5 =,Z =,Z 4 =. The Wilcoxon statistic is W = Z i R i. i= In this case, W = +3+4 5=. Because the density is symmetric about the median, if R i is given then Z i is still equally likely to be ±, so (R,...,R n ) and (Z,...,Z n ) are independent. (Note that if R j is given, the odds about Z i (i j) are unaffected since the observations X,...,X n are independent.) Now the R i are simply a permutation of (,,..., n), so W is a sum of independent random variables V i where V i = ±i with equal probability.

5 5. Properties Of The Wilcoxon Statistic Under H 0,E(V i ) = 0 and Var V i = E(V i )=i,so E(W )= E(V i )=0, Var W = i= i = i= n(n +)(n +). 6 The V i do not have the same distribution, but the central limit theorem still applies because Liapounov s condition is satisfied: n i= E[ V i µ i 3 ] ( 0 as n. n i= σ i )3/ Now the V i have mean µ i =0,so V i µ i 3 = V i 3 = i 3 and σ i =VarV i = i. Thus the Liapounov fraction is the sum of the first n cubes divided by the 3/ power of the sum of the first n squares, which is n (n +) /4 [n(n +)(n +)/6] 3/. For large n, the numerator is of the order of n 4 and the denominator is of the order of (n 3 ) 3/ = n 9/. Therefore the fraction is of the order of / n 0asn. By the central limit theorem, [W E(W )]/σ(w ) is approximately normal (0,) for large n, with E(W ) = 0 and σ (W )=n(n +)(n +)/6. If the median is larger than its value m under H 0, we expect W to have a positive bias. Thus we reject H 0 if W c. (If H were ξ.5 <m), we would reject if W c.) The value of c is determined by our choice of the significance level α. Problems. Suppose we are using a sign test with n = observations to decide between the null hypothesis H 0 : m = 40 and the alternative H : m>40, whee m is the median. We use the statistic Y = the number of observations that are less than or equal to 40. We reject H 0 if and only if Y c. Find the power function K(p) in terms of c and p = F (40), and the probability α of a type error for c =.. Let m be the median of a random variable with density symmetric about m. Using the Wilcoxon test, we are testing H 0 : m = 60 vs. H : m>60 based on n =6 observations, which are as follows: 76.9, 58.3, 5., 58.8, 7.4, 69.8, 59.7, 6.7, 56.6, 74.5, 84.4, 65., 47.8, 77.8, 60., 60.5. Compute the Wilcoxon statistic and determine whether H 0 is rejected at the.05 significance level, i.e., the probability of a type error is.05. 3. When n is small, the distribution of W can be found explicitly. Do it for n =,, 3.