Chi-squared goodness-of-fit test.

Sectio 1 Chi-squaed goodess-of-fit test. Example. Let us stat with a Matlab example. Let us geeate a vecto X of 1 i.i.d. uifom adom vaiables o [, 1] : X=ad(1,1). Paametes (1, 1) hee mea that we geeate a 1 1 matix o uifom adom vaiables. Let us test if the vecto X comes fom distibutio U[, 1] usig 2 goodess-of-fit test: [H,P,STATS]=chi2gof(X, cdf,@(z)uifcdf(z,,1), edges,:.2:1) The output is H =, P =.953, STATS = chi2stat: 7.9 df: 4 edges: [.2.4.6.8 1] O: [17 16 24 29 14] E: [2 2 2 2 2] We accept ull hypothesis H : P = U[, 1] at the default level of sigificace =.5 sice the p-value.953 is geate tha. The meaig of othe paametes will become clea whe we explai how this test woks. Paamete cdf takes the hadle @ to a fully specified c.d.f. Fo example, to test if the data comes fom N(3, 5) we would use @(z)omcdf(z,3,5), o to test Poisso distibutio (4) we would use @(z)poisscdf(z,4). It is impotat to ote that whe we use chi-squaed test to test, fo example, the ull hypothesis H : P = N(1, 2), the alteative hypothesis is H : P = N(1, 2). This is diffeet fom the settig of t-tests whee we would assume that the data comes fom omal distibutio ad test H : µ = 1 vs. H : µ = 1. 62

Peaso s theoem. PSfag eplacemets Chi-squaed goodess-of-fit test is based o a pobabilistic esult that we will pove i this sectio. 1 2 B 1 B 2... B p 1 p 2 p Figue 1.1: Let us coside boxes B 1,..., B ad thow balls X 1,..., X ito these boxes idepedetly of each othe with pobabilities so that Let j be a umbe of balls i the jth box: P(X i B 1 ) = p 1,..., P(X i B ) = p, p 1 +... + p = 1. j = #{balls X 1,..., X i the box B j } = I(X l B j ). O aveage, the umbe of balls i the jth box will be p j sice l=1 E j = EI(X l B j ) = P(X l B j ) = p j. l=1 l=1 We ca expect that a adom vaiable j should be close to p j. Fo example, we ca use a Cetal Limit Theoem to descibe pecisely how close j is to p j. The ext esult tells us how we ca descibe the closeess of j to p j simultaeously fo all boxes j. The mai difficulty i this Thoem comes fom the fact that adom vaiables j fo j ae ot idepedet because the total umbe of balls is fixed 1 +... + =. If we kow the couts i 1 boxes we automatically kow the cout i the last box. Theoem.(Peaso) We have that the adom vaiable ( j p j ) 2 d 2 p 1 j j=1 coveges i distibutio to 2 1-distibutio with ( 1) degees of feedom. 63

Poof. Let us fix a box B j. The adom vaiables I(X 1 B j ),..., I(X B j ) that idicate whethe each obsevatio X i is i the box B j o ot ae i.i.d. with Beoulli distibutio B(p j ) with pobability of success ad vaiace EI(X 1 B j ) = P(X 1 B j ) = p j Va(I(X 1 B j )) = p j (1 p j ). Theefoe, by Cetal Limit Theoem the adom vaiable j p j l=1 I(X l B j ) p j = p j (1 p j ) p j (1 p j ) l=1 = I(X l B j ) E d N(, 1) Va coveges i distibutio to N(, 1). Theefoe, the adom vaiable j p j 1 p j N(, 1) = N(, 1 p j ) pj d coveges to omal distibutio with vaiace 1 p j. Let us be a little ifomal ad simply say that j p j Z j pj whee adom vaiable Z j N(, 1 p j ). We kow that each Z j has distibutio N(, 1 p j ) but, ufotuately, this does ot tell us what the distibutio of the sum 2 Z j will be, because as we metioed above.v.s j ae ot idepedet ad thei coelatio stuctue will play a impotat ole. To compute the covaiace betwee Z i ad Z j let us fist compute the covaiace betwee which is equal to i p i j p j ad pi pj i p j j p j 1 E pi pj = pi p j (E i j E i p j E j p i + 2 p i p j ) 1 2 1 2 = pi p j (E i j p i p j p j p i + p i p j ) = pi p j (E i j p i p j ). To compute E i j we will use the fact that oe ball caot be iside two diffeet boxes simultaeously which meas that I(X l B i )I(X l B j ) =. (1..1) 64

Theefoe, E i j = E I(X l B i ) I(X l B j ) = E I(X l B i )I(X l B j ) l=1 l =1 l,l = E I(X l B i )I(X l B j ) +E I(X l B i )I(X l B j ) l=l this equals to by (1..1) = ( 1)EI(X l B j )EI(X l B j ) = ( 1)p i p j. Theefoe, the covaiace above is equal to 1 2 ( 1)p i p j p i p j = p i p j. p i p j To summaize, we showed that the adom vaiable l=l (j p j ) 2 Z 2 p j j. j=1 j=1 whee omal adom vaiables Z 1,..., Z satisfy EZ 2 = 1 p i ad covaiace EZ i Z j = p i p j. i To pove the Theoem it emais to show that this covaiace stuctue of the sequece of (Z i ) implies that thei sum of squaes has 2 1-distibutio. To show this we will fid a diffeet epesetatio fo 2 Z i. Let g 1,..., g be i.i.d. stadad omal adom vaiables. Coside two vectos g = (g 1,..., g ) T ad p = ( p 1,..., p ) T ad coside a vecto g (g p)p, whee g p = g 1 p1 +... + g p is a scala poduct of g ad p. We will fist pove that g (g p)p has the same joit distibutio as (Z 1,..., Z ). (1..2) To show this let us coside two coodiates of the vecto g (g p)p : ad compute thei covaiace: E i th : g i g l pl pi ad j th : g j g l pl pj l=1 l=1 g i g l pl pi g j g l pl pj l=1 l=1 = p i pj p j pi + p l pi pj = 2 p i p j + p i p j = p i p j. l=1 65

Similaly, it is easy to compute that This poves (1..2), which povides us with aothe way to fomulate the covegece, amely, we have But this vecto has a simple geometic itepetatio. Sice vecto p is a uit vecto: vecto Vl = (p. g)p is the pojectio of vecto g o the lie alog p ad, theefoe, vecto Vz = g - (p. g)p will be the pojectio of g oto the plae othogoal to p, as show i figue 1.2. Figue 1.2: New coodiate system, Let us coside a ew othoomal coodiate system with the fist basis vecto (fist axis) equal top. I this ew coodiate system vecto g will have coodiates

obtaied fom g by othogoal tasfomatio V = (p, p 2,..., p ) that maps caoical basis ito this ew basis. But we poved i Lecue 4 that i that case g 1,..., g will also be i.i.d. stadad omal. Fom figue 1.2 it is obvious that vecto V 2 = g (p g)p i the ew coodiate system has coodiates ad, theefoe, (, g 2,..., g ) T V 2 2 = g (p g)p 2 = (g ) 2 +... + (g ) 2. 2 But this last sum, by defiitio, has 2 1 distibutio sice g 2,, g ae i.i.d. stadad omal. This fiishes the poof of Theoem. Chi-squaed goodess-of-fit test fo simple hypothesis. Suppose that we obseve a i.i.d. sample X 1,..., X of adom vaiables that take a fiite umbe of values B 1,..., B with ukow pobabilities p 1,..., p. Coside hypotheses H : p i = p i fo all i = 1,...,, H 1 : fo some i, p i = p i. If the ull hypothesis H is tue the by Peaso s theoem T = (i p ) 2 i p i=1 i d 2 1 whee i = #{X j : X j = B i } ae the obseved couts i each categoy. O the othe had, if H 1 holds the fo some idex i, p i = p i ad the statistics T will behave diffeetly. If p i is the tue pobability P(X 1 = B i ) the by CLT i pi d N(, 1 p i ). p i If we ewite i p i = i p i + (p i p i ) pi i p i = + p i p i pi pi p i pi pi the the fist tem coveges to N(, (1 p i )p i /p i ) ad the secod tem diveges to plus o mius because p i = p i. Theefoe, ( i p ) 2 i p i + which, obviously, implies that T +. Theefoe, as sample size iceases the distibutio of T ude ull hypothesis H will appoach 2 1-distibutio ad ude alteative hypothesis H 1 it will shift to +, as show i figue 1.3. 67

.1.9.8.7 H : T 2 1.6.5.4.3 H 1 : T + PSfag eplacemets.2.1 1 2 3 4 5 6 c Figue 1.3: Behavio of T ude H ad H 1. Theefoe, we defie the decisio ule H α = 1 : T c H 2 : T > c. We choose the theshold c fom the coditio that the eo of type 1 is equal to the level of sigificace : = P 1 (α = H 1 ) = P 1 (T > c) 2 1 (c, ) sice ude the ull hypothesis the distibutio of T is appoximated by 2 1 distibutio. Theefoe, we take c such that = 2 1 (c, ). This test α is called the chi-squaed goodessof-fit test. Example. (Motaa outlook poll.) I a 1992 poll 189 Motaa esidets wee asked (amog othe thigs) whethe thei pesoal fiacial status was wose, the same o bette tha a yea ago. Wose Same Bette Total 58 64 67 189 We wat to test the hypothesis H that the udelyig distibutio is uifom, i.e. p 1 = p 2 = p 3 = 1/3. Let us take level of sigificace =.5. The the theshold c i the chi-squaed 68

test α = H : T c H 1 : T > c is foud fom the coditio that 2 3 1=2(c, ) =.5 which gives c = 5.9. We compute chi-squaed statistic (58 189/3) 2 (64 189/3) 2 (67 189/3) 2 T = + + =.666 < 5.9 189/3 189/3 189/3 which meas that we accept H at the level of sigificace.5. Goodess-of-fit fo cotiuous distibutio. Let X 1,..., X be a i.i.d. sample fom ukow distibutio P ad coside the followig hypotheses: H : P = P H 1 : P = P fo some paticula, possibly cotiuous distibutio P. To apply the chi-squaed test above we will goup the values of Xs ito a fiite umbe of subsets. To do this, we will split a set of all possible outcomes X ito a fiite umbe of itevals I 1,..., I as show i figue 1.4..4.35 p.d.f. of P.3.25.2 PSfag eplacemets.15.1 p 2.5 p 1 p I 1 I 2 I x Figue 1.4: Discetizig cotiuous distibutio. 69

The ull hypothesis H, of couse, implies that fo all itevals Theefoe, we ca do chi-squaed test fo P(X I j ) = P (X I j ) = p j. H : P(X I j ) = p j fo all j H 1 : othewise. Askig whethe H holds is, of couse, a weake questio that askig if H holds, because H implies H but ot the othe way aoud. Thee ae may distibutios diffeet fom P that have the same pobabilities of the itevals I 1,..., I as P. O the othe had, if we goup ito moe ad moe itevals, ou discete appoximatio of P will get close ad close to P, so i some sese H will get close to H. Howeve, we ca ot split ito too may itevals eithe, because the 2 1-distibutio appoximatio fo statistic T i Peaso s theoem is asymptotic. The ule of thumb is to goup the data i such a way that the expected cout i each iteval p i = P (X I i ) 5 is at least 5. (Matlab, fo example, will give a waig if this expected umbe will be less tha five i ay iteval.) Oe appoach could be to split ito itevals of equal pobabilities = 1/ ad choose thei umbe so that p i p i = 5. Example. Let us go back to the example fom Lectue 2. Let us geeate 1 obsevatios fom Beta distibutio B(5, 2). X=betad(5,2,1,1); Let us fit omal distibutio N(µ, ν 2 ) to this data. The MLE ˆµ ad ˆν ae mea(x) =.7421, std(x,1)=.1392. Note that std(x) i Matlab will poduce the squae oot of ubiased estimato (/ 1)ˆν 2. Let us test the hypothesis that the sample has this fitted omal distibutio. [H,P,STATS]= chi2gof(x, cdf,@(z)omcdf(z,.7421,.1392)) outputs H = 1, P =.41, STATS = chi2stat: 2.7589 df: 7 edges: [1x9 double] O: [14 4 11 14 14 16 21 6] E: [1x8 double] Ou hypothesis was ejected with p-value of.41. Matlab split the eal lie ito 8 itevals of equal pobabilities. Notice df: 7 - the degees of feedom 1 = 8 1 = 7. 7