Goodness of Fit. Proportional Model. Probability Models & Frequency Data



Similar documents
12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

EMPIRICAL FREQUENCY DISTRIBUTION

Comparing Multiple Proportions, Test of Independence and Goodness of Fit

Chapter 23. Two Categorical Variables: The Chi-Square Test

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Testing Research and Statistical Hypotheses

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Chapter 3 RANDOM VARIATE GENERATION

The Chi-Square Test. STAT E-50 Introduction to Statistics

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Normality Testing in Excel

HYPOTHESIS TESTING WITH SPSS:

Chi Square Tests. Chapter Introduction

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Inference for two Population Means

Study Guide for the Final Exam

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Simple Linear Regression Inference

Solutions to Homework 10 Statistics 302 Professor Larget

Lecture 5 : The Poisson Distribution

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Variables Control Charts

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Introduction to Quantitative Methods

1.5 Oneway Analysis of Variance

Section 13, Part 1 ANOVA. Analysis Of Variance

Is it statistically significant? The chi-square test

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

Chi-square test Fisher s Exact test

Poisson Models for Count Data

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test

13: Additional ANOVA Topics. Post hoc Comparisons

Statistical Impact of Slip Simulator Training at Los Alamos National Laboratory

Comparing Means in Two Populations

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

MAT 155. Key Concept. September 27, S5.5_3 Poisson Probability Distributions. Chapter 5 Probability Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions

How To Test For Significance On A Data Set

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples

Fairfield Public Schools

individualdifferences

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Section 12 Part 2. Chi-square test

1 Nonparametric Statistics

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Simulating Chi-Square Test Using Excel

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Stat 5102 Notes: Nonparametric Tests and. confidence interval

II. DISTRIBUTIONS distribution normal distribution. standard scores

2 Sample t-test (unequal sample sizes and unequal variances)

Principles of Hypothesis Testing for Public Health

6.4 Normal Distribution

CHAPTER IV FINDINGS AND CONCURRENT DISCUSSIONS

Chapter 7 Section 1 Homework Set A

statistics Chi-square tests and nonparametric Summary sheet from last time: Hypothesis testing Summary sheet from last time: Confidence intervals

CHAPTER 14 NONPARAMETRIC TESTS

Hypothesis Testing: Two Means, Paired Data, Two Proportions

Testing differences in proportions

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

SAS Software to Fit the Generalized Linear Model

Association Between Variables

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Lecture Notes Module 1

Permutation Tests for Comparing Two Populations

Recall this chart that showed how most of our course would be organized:

UNDERSTANDING THE TWO-WAY ANOVA

C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Paired 2 Sample t-test

Likelihood: Frequentist vs Bayesian Reasoning

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Chapter 8: Hypothesis Testing for One Population Mean, Variance, and Proportion

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Recommend Continued CPS Monitoring. 63 (a) 17 (b) 10 (c) (d) 20 (e) 25 (f) 80. Totals/Marginal

Crosstabulation & Chi Square

Using Stata for Categorical Data Analysis

NCSS Statistical Software. One-Sample T-Test

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Generalized Linear Models

Paired T-Test. Chapter 208. Introduction. Technical Details. Research Questions

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests

Analysis of Variance ANOVA

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Using Excel for inferential statistics

Fat Content in Ground Meat: A statistical analysis

Exploratory Data Analysis

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

Friedman's Two-way Analysis of Variance by Ranks -- Analysis of k-within-group Data with a Quantitative Response Variable

Transcription:

Probability Models & Frequency Data Goodness of Fit Proportional Model Chi-square Statistic Example R Distribution Assumptions Example R 1 Goodness of Fit Goodness of fit tests are used to compare any observed frequency distribution against an expected frequency distribution. We previously did specialized examples of this for a probability distribution (the 50:50 expected right- vs. lefthand toad example) and binomial distribution (sperm genes on X chromosome of mice). The binomial test we did is a specialized form for categorical variables with only two outcomes. Here we will introduce a more generalized form. 2 Proportional Model The proportional model is one of the simplest probability model. The frequency of occurrence of events is proportional to the number of opportunities (e.g., X chromosome example). What would we do, however, if we had multiple proportions? A more generalized form of this test is the chi-square (χ 2 ) goodness-of-fit-test. 3

Example 8.1: Under the proportional model, one would expect babies born in the U.S. to be born in equal proportions across the days of the week (i.e., 14.28% per day). Is this true? Shown are a random sample of 350 births from across the U.S. During the year 1999. 63 33 4 Goodness-of-Fit Test The χ 2 goodness-of-fit test use the chi-square statistic (based upon the chis-square distribution) to compare frequency data to a model stated by the null hypothesis. Continuing with our example: H 0 : The probability of birth is the same every day of the week. H A : The probability of birth is not the same every day of the week. Again, H 0 and H A are statements about the population from which the sample is obtained. 5 In order to proceed, we need to determine the expected frequencies under the null model. In examining the calender for 1999, we see that there are not an even number of each day (52) in the year (there was an additional Friday), so we need to adjust for this. 6

Goodness-of-Fit Test The calculation of the expected frequencies is straight forward. Expected = 350 (52/365) = 49.863 NB: the sum of the expected frequencies must sum to the total observed (350). Once you have a full set of observed and expected frequencies, one can then determine a chi-square statistic and associated probability. 7 Chi-square Statistic The chi-square statistic measures the discrepancy between between observed and expected frequncies (make sure to always use the absolute frquencies [counts] not relative frequencies [proportions]). Chi-square for each element can be calculated as: 2 = Observed Expected 2 Expected = 33 49.863 =5.70 49.863 8 Chi-square Statistic Τηε χ 2 statistic is additive across all levels, so: χ 2 = 5.70 + 1.58 + 3.46 + 3.46 + 0.16 + 0.53 + 0.16 = 15.05 We now have a calculated test statistic and as usual need to compare it to a table value at a particular degree of freedom to make our decision. In other words, is 15.05 large enough to be significantly different? df = (number of categories) -1 = 7-1 = 6 From Statistical Table A in your text, we see that at df = 6, the critical value for χ 2 is 12.59. Therefore, we reject the null hypothesis and conclude that there are unequal proportions of births among days. 9

Chi-square Statistic This type of problem can most easily be solved using a table format: 10 Chi-square Statistic Assuming equal probabilities this can be very easily done in R using chisq.test: > births<-c(33,41,63,63,47,56,47) > chisq.test(births) Chi-squared test for given probabilities data: births X-squared = 15.24, df = 6, p-value = 0.01847 How can we do this with the unequal probabilities that we have? This is a bit more complicated, but still straightforward: 11 Chi-square Statistic > obsbirths<-births > days<-c(52,52,52,52,52,53,52) > expbirths<-350*(days/365) > expbirths [1] 49.86301 49.86301 49.86301 49.86301 49.86301 50.82192 [7] 49.86301 > chi<-sum((obsbirths-expbirths)^2/expbirths) > chi [1] 15.05676 >?pchisq > pchisq(chi,df=6) [1] 0.9801802 > pchisq(chi,df=6,lower.tail=false) [1] 0.01981982 What's going on here? 12

Chi-square Distribution The chi-square distribution is a theoretical probability distribution (analogous to normal, binomial, poisson, etc.). Note that the distribution is not symmetrical and is highly skewed. When df = 1 then asymptotic to both axes! 13 Chi-square Distribution If χ 2 is a random variable with a chi-square distribution: χ 2 is a positive real number The density function depends only on n (df) The expected value of χ 2 = n The variance of χ 2 = 2 n The graph of f (χ 2 ) is not symmetrical The graph of f (χ 2 ) approaches symmetry as ν= 14 Chi-square Distribution 15

We can explore the properties of the chi-square distribution through the use of R functions and graphics: > par(mfrow=c(2,2),mar=c(3,4,3,3)) > layout.show(4) > plot(dchisq(1,df=1:30)) > plot(dchisq(5,df=1:30)) > plot(dchisq(10,df=1:30)) > plot(dchisq(15,df=1:30)) 16 17 Chi-square Assumptions The sampling distribution of the chi-square statistic only approximately follows the chi-square distribution (but pretty closely). Two assumptions apply: 1) None of the categories should have an expected frequency less than one. 2) No more than 25% of the categories should have expected frequencies less than five. 18

Goodness-of-Fit Test - Two Proportions - The chi-square goodness of fit test is a very general one and can be used in a variety of situations. It can also be used when there are only two proportions, a replacement for the binomial test, but at a cost...it is much less powerful in this situation. So, use the binomial test whenever appropriate. 19 The poisson distribution describes the number of successes in blocks of time or space, when successes happen independently of each other and occur with equal probability at every point in time or space. The poisson is often useful in biological studies because it is a starting place for evaluating whether or not an observed pattern is random or not. If the null model is rejected, the distribution may be either clumped or dispersed. 20 A clumped distribution arises when the presence of one success is increases the probability of success for adjacent observations (e.g., occurrences of a contagious disease). A dispersed distribution is the opposite: the presence of one success decreases the probability of success for adjacent observations (e.g., animals with well defended territories). 21

22 The poisson distribution is constructed using the probability of X successes occurring in any given block of time or space: Pr [ X successes]= e x X! Where mu is the mean number of independent successes in time or space (expressed as a unit count) and e is the base of the natural log. 23 - Example - Example 8.6 provides the example of an assessment of the fossil record. They ask, do extinctions occur randomly through the fossil record or are their periods where extinction rates are unusually high (mass extinctions) compared to background rates? Fossil marine invertebrates are an ideal taxa to test this question as they preserve well. The data are the number of recorded extinctions in 76 contiguous blocks of time. 24

25 The hypotheses are: - Example - H 0 : The number of extinctions per time interval has a Poisson distribution. H A : The number of extinctions per time interval does not have a P distr. We need to begin by estimating μ, the mean number of extinctions per time interval. As usual, μ, can be estimated by x-bar (= 4.21, n = 76). We need to use the same protocol and generate expected values to compare to our observed values, so return to the formula for calculation of the poisson distribution. 26 - Example - For example, for 3 extinctions: Pr [3 extinctions]= e 4.21 4.21 3 3! Expected[3 extinctions] = 76 x 0.1846 = 14.03 No, expand for all categories... 27

28 - Example - We now have a chi-square test statistic calculated. We need to determine the degrees of freedom. In the broadest sense, df normally is n 1. However, in a variety of circumstances, we need to also subtract the number of parameters being estimated from the data. So, df = 8-1-1=6. The critical value for χ 2 of 12.59 at P = 0.05 and df = 6 is 12.59. Thus, we reject the null hypothesis and conclude extinctions are non-random. 29 > extinctions<-c(0,13,15,16,7,10,4,2,1,2,6) >?dpois > dpois(extinctions, 4.21) [1] 1.484637e-02 3.111768e-04 2.626347e-05 6.910575e-06 [5] 6.905011e-02 7.156129e-03 1.943289e-01 1.315693e-01 [9] 6.250321e-02 1.315693e-01 1.148102e-01 > hist(dpois(extinctions, 4.21)) 30

- Example - > extinctions2<-c(13,15,16,7,10,4,2,9) > chisq.test(extinctions2) Chi-squared test for given probabilities data: extinctions2 X-squared = 18.7368, df = 7, p-value = 0.009053 31 We can explore the properties of the chi-square distribution through the use of R functions and graphics: > par(mfrow=c(2,2),mar=c(3,4,3,3)) > layout.show(4) > plot(dpois(1:25,1)) > plot(dpois(1:25,2)) > plot(dpois(1:25,4.21)) Our example > plot(dpois(1:25,10)) 32 33