Chi-squared goodness-of-fit test.

Similar documents

Maximum Likelihood Estimators.

Periodic Review Probabilistic Multi-Item Inventory System with Zero Lead Time under Constraints and Varying Order Cost

Properties of MLE: consistency, asymptotic normality. Fisher information.

Understanding Financial Management: A Practical Guide Guideline Answers to the Concept Check Questions

The Binomial Distribution

Two degree of freedom systems. Equations of motion for forced vibration Free vibration analysis of an undamped system

Hypothesis testing. Null and alternative hypotheses

Overview of some probability distributions.

Semipartial (Part) and Partial Correlation

The force between electric charges. Comparing gravity and the interaction between charges. Coulomb s Law. Forces between two charges

Learning Objectives. Chapter 2 Pricing of Bonds. Future Value (FV)

Section 11.3: The Integral Test

Chapter 14 Nonparametric Statistics

Annuities and loan. repayments. Syllabus reference Financial mathematics 5 Annuities and loan. repayments

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Money Math for Teens. Introduction to Earning Interest: 11th and 12th Grades Version

Estimating Surface Normals in Noisy Point Cloud Data

1. C. The formula for the confidence interval for a population mean is: x t, which was

One-sample test of proportions

Chapter 7 Methods of Finding Estimators

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Derivation of Annuity and Perpetuity Formulae. A. Present Value of an Annuity (Deferred Payment or Ordinary Annuity)

Finance Practice Problems

Long-Term Trend Analysis of Online Trading --A Stochastic Order Switching Model

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

I. Chi-squared Distributions

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Mechanics 1: Motion in a Central Force Field

Infinite Sequences and Series

Course Notes: Nonlinear Dynamics and Hodgkin-Huxley Equations

CS103X: Discrete Structures Homework 4 Solutions

Saturated and weakly saturated hypergraphs

Vector Calculus: Are you ready? Vectors in 2D and 3D Space: Review

Figure 2. So it is very likely that the Babylonians attributed 60 units to each side of the hexagon. Its resulting perimeter would then be 360!

The dinner table problem: the rectangular case

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

An Introduction to Omega

Questions & Answers Chapter 10 Software Reliability Prediction, Allocation and Demonstration Testing

Practice Problems for Test 3

On the Optimality and Interconnection of Valiant Load-Balancing Networks

Mechanics 1: Work, Power and Kinetic Energy

Lesson 17 Pearson s Correlation Coefficient

Confidence Intervals for One Mean

Paper SD-07. Key words: upper tolerance limit, macros, order statistics, sample size, confidence, coverage, binomial

Chapter 4: Matrix Norms

2 r2 θ = r2 t. (3.59) The equal area law is the statement that the term in parentheses,

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

Symmetric polynomials and partitions Eugene Mukhin

Breakeven Holding Periods for Tax Advantaged Savings Accounts with Early Withdrawal Penalties

Skills Needed for Success in Calculus 1

1 Correlation and Regression Analysis

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)

THE PRINCIPLE OF THE ACTIVE JMC SCATTERER. Seppo Uosukainen

Lesson 15 ANOVA (analysis of variance)

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Logistic Regression, AdaBoost and Bregman Distances

LATIN SQUARE DESIGN (LS) -With the Latin Square design you are able to control variation in two directions.

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

1 Computing the Standard Deviation of Sample Means

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

AP Calculus AB 2006 Scoring Guidelines Form B

The LCOE is defined as the energy price ($ per unit of energy output) for which the Net Present Value of the investment is zero.

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

Efficient Redundancy Techniques for Latency Reduction in Cloud Systems

Financing Terms in the EOQ Model

Spirotechnics! September 7, Amanda Zeringue, Michael Spannuth and Amanda Zeringue Dierential Geometry Project

2. TRIGONOMETRIC FUNCTIONS OF GENERAL ANGLES

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Basic Elements of Arithmetic Sequences and Series

Coordinate Systems L. M. Kalnins, March 2009

Sampling Distribution And Central Limit Theorem

Displacement, Velocity And Acceleration

Nontrivial lower bounds for the least common multiple of some finite sequences of integers

Theorems About Power Series

STUDENT RESPONSE TO ANNUITY FORMULA DERIVATION

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Measures of Spread and Boxplots Discrete Math, Section 9.4

Episode 401: Newton s law of universal gravitation

3 Basic Definitions of Probability Theory

5: Introduction to Estimation

Moment and couple. In 3-D, because the determination of the distance can be tedious, a vector approach becomes advantageous. r r

Exam 3. Instructor: Cynthia Rudin TA: Dimitrios Bisias. November 22, 2011

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

4.3. The Integral and Comparison Tests

Strategic Remanufacturing Decision in a Supply Chain with an External Local Remanufacturer

Determining the sample size

Gauss Law. Physics 231 Lecture 2-1

Output Analysis (2, Chapters 10 &11 Law)

Now here is the important step

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Chapter 5: Inner Product Spaces

Statistical inference: example 1. Inferential Statistics

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Risk Sensitive Portfolio Management With Cox-Ingersoll-Ross Interest Rates: the HJB Equation

3. Greatest Common Divisor - Least Common Multiple

Transcription:

Sectio 1 Chi-squaed goodess-of-fit test. Example. Let us stat with a Matlab example. Let us geeate a vecto X of 1 i.i.d. uifom adom vaiables o [, 1] : X=ad(1,1). Paametes (1, 1) hee mea that we geeate a 1 1 matix o uifom adom vaiables. Let us test if the vecto X comes fom distibutio U[, 1] usig 2 goodess-of-fit test: [H,P,STATS]=chi2gof(X, cdf,@(z)uifcdf(z,,1), edges,:.2:1) The output is H =, P =.953, STATS = chi2stat: 7.9 df: 4 edges: [.2.4.6.8 1] O: [17 16 24 29 14] E: [2 2 2 2 2] We accept ull hypothesis H : P = U[, 1] at the default level of sigificace =.5 sice the p-value.953 is geate tha. The meaig of othe paametes will become clea whe we explai how this test woks. Paamete cdf takes the hadle @ to a fully specified c.d.f. Fo example, to test if the data comes fom N(3, 5) we would use @(z)omcdf(z,3,5), o to test Poisso distibutio (4) we would use @(z)poisscdf(z,4). It is impotat to ote that whe we use chi-squaed test to test, fo example, the ull hypothesis H : P = N(1, 2), the alteative hypothesis is H : P = N(1, 2). This is diffeet fom the settig of t-tests whee we would assume that the data comes fom omal distibutio ad test H : µ = 1 vs. H : µ = 1. 62

Peaso s theoem. PSfag eplacemets Chi-squaed goodess-of-fit test is based o a pobabilistic esult that we will pove i this sectio. 1 2 B 1 B 2... B p 1 p 2 p Figue 1.1: Let us coside boxes B 1,..., B ad thow balls X 1,..., X ito these boxes idepedetly of each othe with pobabilities so that Let j be a umbe of balls i the jth box: P(X i B 1 ) = p 1,..., P(X i B ) = p, p 1 +... + p = 1. j = #{balls X 1,..., X i the box B j } = I(X l B j ). O aveage, the umbe of balls i the jth box will be p j sice l=1 E j = EI(X l B j ) = P(X l B j ) = p j. l=1 l=1 We ca expect that a adom vaiable j should be close to p j. Fo example, we ca use a Cetal Limit Theoem to descibe pecisely how close j is to p j. The ext esult tells us how we ca descibe the closeess of j to p j simultaeously fo all boxes j. The mai difficulty i this Thoem comes fom the fact that adom vaiables j fo j ae ot idepedet because the total umbe of balls is fixed 1 +... + =. If we kow the couts i 1 boxes we automatically kow the cout i the last box. Theoem.(Peaso) We have that the adom vaiable ( j p j ) 2 d 2 p 1 j j=1 coveges i distibutio to 2 1-distibutio with ( 1) degees of feedom. 63

Poof. Let us fix a box B j. The adom vaiables I(X 1 B j ),..., I(X B j ) that idicate whethe each obsevatio X i is i the box B j o ot ae i.i.d. with Beoulli distibutio B(p j ) with pobability of success ad vaiace EI(X 1 B j ) = P(X 1 B j ) = p j Va(I(X 1 B j )) = p j (1 p j ). Theefoe, by Cetal Limit Theoem the adom vaiable j p j l=1 I(X l B j ) p j = p j (1 p j ) p j (1 p j ) l=1 = I(X l B j ) E d N(, 1) Va coveges i distibutio to N(, 1). Theefoe, the adom vaiable j p j 1 p j N(, 1) = N(, 1 p j ) pj d coveges to omal distibutio with vaiace 1 p j. Let us be a little ifomal ad simply say that j p j Z j pj whee adom vaiable Z j N(, 1 p j ). We kow that each Z j has distibutio N(, 1 p j ) but, ufotuately, this does ot tell us what the distibutio of the sum 2 Z j will be, because as we metioed above.v.s j ae ot idepedet ad thei coelatio stuctue will play a impotat ole. To compute the covaiace betwee Z i ad Z j let us fist compute the covaiace betwee which is equal to i p i j p j ad pi pj i p j j p j 1 E pi pj = pi p j (E i j E i p j E j p i + 2 p i p j ) 1 2 1 2 = pi p j (E i j p i p j p j p i + p i p j ) = pi p j (E i j p i p j ). To compute E i j we will use the fact that oe ball caot be iside two diffeet boxes simultaeously which meas that I(X l B i )I(X l B j ) =. (1..1) 64

Theefoe, E i j = E I(X l B i ) I(X l B j ) = E I(X l B i )I(X l B j ) l=1 l =1 l,l = E I(X l B i )I(X l B j ) +E I(X l B i )I(X l B j ) l=l this equals to by (1..1) = ( 1)EI(X l B j )EI(X l B j ) = ( 1)p i p j. Theefoe, the covaiace above is equal to 1 2 ( 1)p i p j p i p j = p i p j. p i p j To summaize, we showed that the adom vaiable l=l (j p j ) 2 Z 2 p j j. j=1 j=1 whee omal adom vaiables Z 1,..., Z satisfy EZ 2 = 1 p i ad covaiace EZ i Z j = p i p j. i To pove the Theoem it emais to show that this covaiace stuctue of the sequece of (Z i ) implies that thei sum of squaes has 2 1-distibutio. To show this we will fid a diffeet epesetatio fo 2 Z i. Let g 1,..., g be i.i.d. stadad omal adom vaiables. Coside two vectos g = (g 1,..., g ) T ad p = ( p 1,..., p ) T ad coside a vecto g (g p)p, whee g p = g 1 p1 +... + g p is a scala poduct of g ad p. We will fist pove that g (g p)p has the same joit distibutio as (Z 1,..., Z ). (1..2) To show this let us coside two coodiates of the vecto g (g p)p : ad compute thei covaiace: E i th : g i g l pl pi ad j th : g j g l pl pj l=1 l=1 g i g l pl pi g j g l pl pj l=1 l=1 = p i pj p j pi + p l pi pj = 2 p i p j + p i p j = p i p j. l=1 65

Similaly, it is easy to compute that This poves (1..2), which povides us with aothe way to fomulate the covegece, amely, we have But this vecto has a simple geometic itepetatio. Sice vecto p is a uit vecto: vecto Vl = (p. g)p is the pojectio of vecto g o the lie alog p ad, theefoe, vecto Vz = g - (p. g)p will be the pojectio of g oto the plae othogoal to p, as show i figue 1.2. Figue 1.2: New coodiate system, Let us coside a ew othoomal coodiate system with the fist basis vecto (fist axis) equal top. I this ew coodiate system vecto g will have coodiates

obtaied fom g by othogoal tasfomatio V = (p, p 2,..., p ) that maps caoical basis ito this ew basis. But we poved i Lecue 4 that i that case g 1,..., g will also be i.i.d. stadad omal. Fom figue 1.2 it is obvious that vecto V 2 = g (p g)p i the ew coodiate system has coodiates ad, theefoe, (, g 2,..., g ) T V 2 2 = g (p g)p 2 = (g ) 2 +... + (g ) 2. 2 But this last sum, by defiitio, has 2 1 distibutio sice g 2,, g ae i.i.d. stadad omal. This fiishes the poof of Theoem. Chi-squaed goodess-of-fit test fo simple hypothesis. Suppose that we obseve a i.i.d. sample X 1,..., X of adom vaiables that take a fiite umbe of values B 1,..., B with ukow pobabilities p 1,..., p. Coside hypotheses H : p i = p i fo all i = 1,...,, H 1 : fo some i, p i = p i. If the ull hypothesis H is tue the by Peaso s theoem T = (i p ) 2 i p i=1 i d 2 1 whee i = #{X j : X j = B i } ae the obseved couts i each categoy. O the othe had, if H 1 holds the fo some idex i, p i = p i ad the statistics T will behave diffeetly. If p i is the tue pobability P(X 1 = B i ) the by CLT i pi d N(, 1 p i ). p i If we ewite i p i = i p i + (p i p i ) pi i p i = + p i p i pi pi p i pi pi the the fist tem coveges to N(, (1 p i )p i /p i ) ad the secod tem diveges to plus o mius because p i = p i. Theefoe, ( i p ) 2 i p i + which, obviously, implies that T +. Theefoe, as sample size iceases the distibutio of T ude ull hypothesis H will appoach 2 1-distibutio ad ude alteative hypothesis H 1 it will shift to +, as show i figue 1.3. 67

.1.9.8.7 H : T 2 1.6.5.4.3 H 1 : T + PSfag eplacemets.2.1 1 2 3 4 5 6 c Figue 1.3: Behavio of T ude H ad H 1. Theefoe, we defie the decisio ule H α = 1 : T c H 2 : T > c. We choose the theshold c fom the coditio that the eo of type 1 is equal to the level of sigificace : = P 1 (α = H 1 ) = P 1 (T > c) 2 1 (c, ) sice ude the ull hypothesis the distibutio of T is appoximated by 2 1 distibutio. Theefoe, we take c such that = 2 1 (c, ). This test α is called the chi-squaed goodessof-fit test. Example. (Motaa outlook poll.) I a 1992 poll 189 Motaa esidets wee asked (amog othe thigs) whethe thei pesoal fiacial status was wose, the same o bette tha a yea ago. Wose Same Bette Total 58 64 67 189 We wat to test the hypothesis H that the udelyig distibutio is uifom, i.e. p 1 = p 2 = p 3 = 1/3. Let us take level of sigificace =.5. The the theshold c i the chi-squaed 68

test α = H : T c H 1 : T > c is foud fom the coditio that 2 3 1=2(c, ) =.5 which gives c = 5.9. We compute chi-squaed statistic (58 189/3) 2 (64 189/3) 2 (67 189/3) 2 T = + + =.666 < 5.9 189/3 189/3 189/3 which meas that we accept H at the level of sigificace.5. Goodess-of-fit fo cotiuous distibutio. Let X 1,..., X be a i.i.d. sample fom ukow distibutio P ad coside the followig hypotheses: H : P = P H 1 : P = P fo some paticula, possibly cotiuous distibutio P. To apply the chi-squaed test above we will goup the values of Xs ito a fiite umbe of subsets. To do this, we will split a set of all possible outcomes X ito a fiite umbe of itevals I 1,..., I as show i figue 1.4..4.35 p.d.f. of P.3.25.2 PSfag eplacemets.15.1 p 2.5 p 1 p I 1 I 2 I x Figue 1.4: Discetizig cotiuous distibutio. 69

The ull hypothesis H, of couse, implies that fo all itevals Theefoe, we ca do chi-squaed test fo P(X I j ) = P (X I j ) = p j. H : P(X I j ) = p j fo all j H 1 : othewise. Askig whethe H holds is, of couse, a weake questio that askig if H holds, because H implies H but ot the othe way aoud. Thee ae may distibutios diffeet fom P that have the same pobabilities of the itevals I 1,..., I as P. O the othe had, if we goup ito moe ad moe itevals, ou discete appoximatio of P will get close ad close to P, so i some sese H will get close to H. Howeve, we ca ot split ito too may itevals eithe, because the 2 1-distibutio appoximatio fo statistic T i Peaso s theoem is asymptotic. The ule of thumb is to goup the data i such a way that the expected cout i each iteval p i = P (X I i ) 5 is at least 5. (Matlab, fo example, will give a waig if this expected umbe will be less tha five i ay iteval.) Oe appoach could be to split ito itevals of equal pobabilities = 1/ ad choose thei umbe so that p i p i = 5. Example. Let us go back to the example fom Lectue 2. Let us geeate 1 obsevatios fom Beta distibutio B(5, 2). X=betad(5,2,1,1); Let us fit omal distibutio N(µ, ν 2 ) to this data. The MLE ˆµ ad ˆν ae mea(x) =.7421, std(x,1)=.1392. Note that std(x) i Matlab will poduce the squae oot of ubiased estimato (/ 1)ˆν 2. Let us test the hypothesis that the sample has this fitted omal distibutio. [H,P,STATS]= chi2gof(x, cdf,@(z)omcdf(z,.7421,.1392)) outputs H = 1, P =.41, STATS = chi2stat: 2.7589 df: 7 edges: [1x9 double] O: [14 4 11 14 14 16 21 6] E: [1x8 double] Ou hypothesis was ejected with p-value of.41. Matlab split the eal lie ito 8 itevals of equal pobabilities. Notice df: 7 - the degees of feedom 1 = 8 1 = 7. 7