Chapter 7. Categorical Data Analysis

Similar documents
Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Is it statistically significant? The chi-square test

Association Between Variables

Chi-square test Fisher s Exact test

Mind on Statistics. Chapter 15

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Statistical tests for SPSS

Solutions to Homework 10 Statistics 302 Professor Larget

Having a coin come up heads or tails is a variable on a nominal scale. Heads is a different category from tails.

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Chapter 23. Two Categorical Variables: The Chi-Square Test

CHAPTER IV FINDINGS AND CONCURRENT DISCUSSIONS

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Contingency Tables and the Chi Square Statistic. Interpreting Computer Printouts and Constructing Tables

OA3-10 Patterns in Addition Tables

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

CHAPTER 11 CHI-SQUARE AND F DISTRIBUTIONS

SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Simple Regression Theory II 2010 Samuel L. Baker

Recommend Continued CPS Monitoring. 63 (a) 17 (b) 10 (c) (d) 20 (e) 25 (f) 80. Totals/Marginal

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Pigeonhole Principle Solutions

Mathematics Content: Pie Charts; Area as Probability; Probabilities as Percents, Decimals & Fractions

Descriptive Statistics and Measurement Scales

Crosstabulation & Chi Square

Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

8 6 X 2 Test for a Variance or Standard Deviation

Mathematics (Project Maths Phase 1)

Conversions between percents, decimals, and fractions

Phonics. High Frequency Words P.008. Objective The student will read high frequency words.

Representation of functions as power series

Introduction to Quantitative Methods

Testing Research and Statistical Hypotheses

6.4 Normal Distribution

Hooray for the Hundreds Chart!!

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

Final Exam Practice Problem Answers

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

UNDERSTANDING THE TWO-WAY ANOVA

MBA 611 STATISTICS AND QUANTITATIVE METHODS

The Chi-Square Test. STAT E-50 Introduction to Statistics

Two Correlated Proportions (McNemar Test)

Additional sources Compilation of sources:

TABLE OF CONTENTS. About Chi Squares What is a CHI SQUARE? Chi Squares Hypothesis Testing with Chi Squares... 2

Assignment #1: Spreadsheets and Basic Data Visualization Sample Solution

Hypothesis Testing: Two Means, Paired Data, Two Proportions

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Nonparametric Tests. Chi-Square Test for Independence

Section 12 Part 2. Chi-square test

Independent samples t-test. Dr. Tom Pierce Radford University

Math 58. Rumbos Fall Solutions to Review Problems for Exam 2

3.4 Statistical inference for 2 populations based on two samples

Elementary Statistics

Session 7 Bivariate Data and Analysis

EXTRA ACTIVITy pages

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Exam Style Questions. Revision for this topic. Name: Ensure you have: Pencil, pen, ruler, protractor, pair of compasses and eraser

DESCRIPTIVE STATISTICS & DATA PRESENTATION*

First-year Statistics for Psychology Students Through Worked Examples

Solutions to Homework 6 Statistics 302 Professor Larget

Chi Square Distribution

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

IBM SPSS Statistics for Beginners for Windows

Topic 8. Chi Square Tests

CS 147: Computer Systems Performance Analysis

Test Positive True Positive False Positive. Test Negative False Negative True Negative. Figure 5-1: 2 x 2 Contingency Table

Using Stata for Categorical Data Analysis

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

One-Way Analysis of Variance (ANOVA) Example Problem

Section 1.1 Exercises (Solutions)

Introduction to Hypothesis Testing

Likelihood: Frequentist vs Bayesian Reasoning

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Normality Testing in Excel

Opgaven Onderzoeksmethoden, Onderdeel Statistiek

The ANOVA for 2x2 Independent Groups Factorial Design

A Comparative Analysis of Speech Recognition Platforms

A magician showed a magic trick where he picked one card from a standard deck. Determine what the probability is that the card will be a queen card?

Instructions Budget Sheets

Topic : Probability of a Complement of an Event- Worksheet 1. Do the following:

Lab 3 - DC Circuits and Ohm s Law

CBA Fractions Student Sheet 1

The Taxman Game. Robert K. Moniot September 5, 2003

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

GRAPHS/TABLES. (line plots, bar graphs pictographs, line graphs)

Lab 11: Budgeting with Excel

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Decision Analysis. Here is the statement of the problem:

PURPOSE OF GRAPHS YOU ARE ABOUT TO BUILD. To explore for a relationship between the categories of two discrete variables

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Mind on Statistics. Chapter 12

Elementary Statistics Sample Exam #3

Understand the role that hypothesis testing plays in an improvement project. Know how to perform a two sample hypothesis test.

Probability Distributions

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Chapter 5 Analysis of variance SPSS Analysis of variance

One-Way Analysis of Variance

Transcription:

Chapter 7 Categorical Data Analysis In Chapter 5 we studied how to test hypotheses involving a single population such as H 0 :µ=5 vs. H a :µ>5. In Chapter 6, we studied how to test hypotheses involving two or more populations, such as H 0 :µ 1 =µ 2 vs. H a : µ 1 >µ 2. In both these chapters we were dealing with quantitative variables such as height of a person, or lengths of alligators etc. In this chapter we will learn how to test hypotheses that involve qualitative or categorical variables. Recall some examples of qualitative or categorical variables such as Gender, Religion, Race, Color etc. They are considered categorical variables because their values are not numeric. Although we may represent values of categorical variables with numbers, it is still a categorical variable. For example we can represent colors using numbers such as 1 for white, 2 for red, 3 for blue and so one, but that does not make it quantitative because you couldn t add 2 and 3 (red and blue) and hope that 5 represents purple. The number 5 probably already represents some other color. The numbers in this color example have no inherent numerical properties; they are simply labels. When dealing with categorical variables, the closest thing to something numerical is the frequency data. So for example let s say I observe the color of the cars passing by my window (assuming I can see a road from my window with cars passing by). Suppose I collect the following data of the first twelve cars that I see: White, Red, Red, Black, Blue, Green, Yellow, Red, White, Black, Blue and White. I can translate this data into a frequency table like this: Color of the vehicle Frequency White 3 Black 2 Blue 2 Yellow 1 Red 3 Green 1 It is relatively easy to obtain such frequency tables for data involving categorical variables as you can see in the above example. Using frequency data, we can test a variety of new types of hypotheses that we have not seen in previous chapters. For example, a favorite family game when you are on a long drive on an interstate highway is for the contestants to pick a color, say white or red and see who gets the most number of cars of that color till you reach your destination. The person who picked the color with the most number of cars wins this extremely delightful and colorful game. Suppose on one long journey while playing this game my family collected the following data: Color of the vehicle Frequency White 44 Red 36 All the Rest 120 Data such as in the above table can be used to test hypotheses about proportions. For example, say I have a hypothesis 25% of all cars produced are White and another 25% are Red and the remaining 50% are all other colors combined. In symbols, this hypothesis can be written as: 1

H 0 : p white = 0.25, p red = 0.25, p other = 0.50 H a : at least one of the proportions is different than specified in the null hypothesis Hypotheses such as this cannot be tested using any of the methods that we have studied so far. For example we couldn t use either the z test or the t test or the F test to test such a hypothesis. Testing this type of hypothesis requires a new type of test called the chi-square test, where chi is pronounced as in the words kind or kite and not like chime. So the bad news is that we will have to learn a new type of Excel function (or in the olden days, a new statistical table), but the good news is that the whole hypothesis testing procedure remains the same. So, we will still have a required or desired significance level (alpha), we will still have a test statistic, a critical value, a rejection region, a p-value, a decision and a conclusion. The rules of rejection remain the same. To obtain the test statistic value, we use a formula which I will tell you shortly. But before I give you the formula, I must tell you that we will need another column of values. In the above table of data, we will add a column for expected frequency if the null hypothesis was true. Color of the Car Observed Frequency (O) Expected Frequency (if the Null Hypothesis was true) (E) White 44 50 Red 36 50 All the Rest 120 100 Total 200 200 Note that in this table, we have changed the label for the second column as Observed Frequency. Please verify that the new column has values that represent hypothesized proportions. For example, since the hypothesized proportion of White cars was 25%, the expected frequency is 50, which happens to be 25% of 200. Now let me give you the formula for the Test Statistic value for this type of hypothesis: Frequency. Chi-Square Value =, Where O is the Observed Frequency and E is the Expected How to calculate the Chi-Square test statistic? We will use the above example to illustrate how to compute the test statistic: Color of the Car (O) (E) (O-E) (O-E) 2 (O-E) 2 /E White 44 50-6 36 0.72 Red 36 50-14 196 3.92 All the Rest 120 100 20 400 4.00 Total 200 200 0 8.64 The chi-square test statistic is 8.64 How to obtain the critical value? If we had a chi-square table, we could obtain the critical value from the table, but since we are getting so good at using Excel, we will obtain it using the excel function =CHIINV(). This function takes two 2

parameters probability and degree of freedom. The probability is basically your alpha value (which is typically 0.05) and the degree of freedom is the number of groups (in our example, three) minus one. So for our example, we will obtain the p-value using the formula =CHIINV(0.05,2) which comes gives us 5.991465. So what is the rejection region? Any chi-square value greater than 5.991465 falls in the rejection region. How to get the p-value? We get the p-value using the =CHIDIST() function. In our example, it will be =CHIDIST(8.64,2) = 0.0133. Decision Time: Since the chi-square test statistic value is 8.64, which is in the rejection region, because it is greater than 5.99, we reject the null. The same decision would be reached using the p-value, which happens to be 0.0133 which is less than alpha value of 0.05. Conclusion: There is sufficient evidence, at significance level 0.05, that the proportions of white, red and other cars are other than 0.25, 0.25 and 0.50 respectively. What if alpha was 0.01? If alpha was 0.01, the critical value would be =CHIINV(0.01,2) = 9.21034. So the rejection region would be χ 2 > 9.21034. The p-value would still be the same at 0.013. Using the critical value approach, we will fail to reject the null hypothesis since 8.64 is not greater than 9.21034. Also, using the p-value we will fail to reject the null because 0.0133 is greater than 0.01. Note that using either of the two approaches, the decision should always be the same. The =CHITEST() function. Excel provides a function called =CHITEST(). Once you have generated the column for the expected frequency, you can use the =CHITEST() function to get the p-value for the test, without having to generate the test statistic value, which requires you to generate the columns necessary to compute (O- E) 2 /E. For the above example the following Excel screen shots will illustrate the use of the =CHITEST() function: 3

Please note that the CHITEST function needs two ranges the actual frequency range, which is the same as the observed frequency range and the expected frequency range. Please also note that the value thus obtained (0.0133) is the same value that we had obtained earlier using the =CHIDIST(8.64,2) function. Please run the above example in Excel yourself to get a better feel of how to use the CHITEST() function. Two Categorical Variables So far in this chapter, we have looked at hypotheses regarding proportions of certain values of a categorical random variable. In the example that we discussed, the random variable was color of a vehicle and the hypothesis was about the proportions of vehicles with certain colors. In such hypotheses, we are looking at frequency data of one categorical variable (color of vehicle in our example). What if we have frequency data on two categorical variables? For example, let us look at the following data that shows the number of wins at home and away for a certain university in various sports in the past five years: Sport Wins at Home Wins Away Football 23 17 Basketball 39 21 Baseball 29 31 Soccer 19 21 Figure 1: Data for Sports vs. Home Field Advantage In the above data, there are two categorical variables Wins (at home or away) and Sport. When we have data like this on two categorical variables, the question that can be asked is whether there is a relationship between the two variables or whether they are independent of each other. For example we can ask the question whether home field advantage depends upon the sport or not. Essentially we are asking whether two variables are independent or dependent. This type of test is called the test of independence. Null Hypothesis: H 0 : Home Field Advantage and Sport are independent of each other Alternate Hypothesis: H a : Home Field Advantage depends on the Sport Chi-Square test can be used to test for independence between two variables. 4

Test Statistic: The formula for the test statistic is the same for two variables as for one variable. It is χ 2 =, Just like in the case of one variable, we will have to create expected frequencies (E). Generating expected frequency for two categorical variables involves some extra work, which I will explain next. How to obtain Expected Frequencies? a. For each row, find the row sum. b. For each column, find the column sum. c. Find the grand sum i.e. the sum of all the row sums (or sum of all the column sums) d. For i th row and j th column, the expected frequency is row sum of i th row * column sum of j th column divided by the grand sum. The row sums, column sums and the grand sum are shown in the table in Figure 2. Sport Wins at Home Wins Away Row Sums Football 23 17 40 Basketball 39 21 60 Baseball 29 31 60 Soccer 19 21 40 Column Sums 110 90 200 Figure 2: Row Sums, Column Sums and Grand Sum Please verify the row sums, the column sums and the grand sum in Figure 2. The expected frequencies are given in the table in Figure 3, using the formula explained in step d above. Sport Wins at Home Wins Away Row Sums Football 22 18 40 Basketball 33 27 60 Baseball 33 27 60 Soccer 22 18 40 Column Sums 110 90 200 Figure 3: Expected Frequencies (E) I will explain a couple of these frequencies in Figure 3. You should verify all the rest of the frequencies. The expected frequency for the cell for Football and Wins at Home is computed as 40*110/200 = 22. The expected frequency for the cell Baseball and Wins Away is computed as 60*90/200 = 27. So now we have the observed frequencies and the expected frequencies in Figures 1 and 3 respectively. Next we calculate the chi-square value. For each cell we need to compute (O-E) 2 /E. The next table shows the values of (O-E) 2 /E for each cell. I will show you Sport Wins at Home Wins Away Football 0.05 0.06 Basketball 1.09 1.33 Baseball 0.48 0.59 Soccer 0.41 0.50 Figure 4: (O-E) 2 /E for each cell The sum of all these values gives the chi-square test statistic value = 4.51 So what should we compare this test value with in other words, what is the critical value? 5

The critical value can be determined from the Excel function =CHIINV(alpha, degrees of freedom). Suppose our alpha is 0.05. For a test of independence, the degree of freedom is given by (r 1)*(c-1) where r is the number of rows (4 in our example) and c is the number of columns (2 in our example). So (4 1)*(2 1) = 3*1 = 3. Critical Value: =CHIINV(0.05,3) = 7.81473 Rejection region: χ 2 > 7.81473 p-value: is given by the excel function =CHIDIST(4.51,3) = 0.2114 Decision using the critical value approach: We fail to reject the null hypothesis because 4.51 is less than 7.81473. Decision using the p-value approach: We fail to reject the null because the p-value of 0.2114 is higher than 0.05. Conclusion: we did not find sufficient evidence, at significance level of 0.05 that home field advantage depends on the sport. Can we get the p-value directly using Excel? Once we compute the expected frequencies (Figure 3) we can compute the p-value without having to calculate the numbers in Figure 4. So we can bypass the calculations of (O-E) 2 /E. How? Using the function =CHITEST(). In this function, we specify two ranges the range for observed frequencies (Figure- 1) and the range for expected frequencies (Figure-4). Suppose the range of data values in Figure-1 is C5:D8 and suppose the range of expected frequencies in Figure-4 is C13:D16 (See Figure 5). Then =CHITEST(C5:D8,C13:D16) will give 0.2114, which is the same p- value we got using =CHIDIST(4.51,3). Figure 6 shows the formulas used in Figure 5. Please try to recreate this example on your computer to get a better sense of how this chi-square test was performed. Figure 5: Excel Calculations of expected frequencies 6

Figure 6: Excel Formulas for the numbers in Figure 5. Summary of the Chapter When dealing with categorical variables certain types of hypotheses can be made. One type of hypothesis involves a single categorical variable. The hypothesis is about the proportions of distribution of the category into different values. Another type of hypothesis involves two categorical variables. The hypothesis is regarding whether the two variables are independent or dependent on each other. The test of hypothesis involving categorical variables uses a chi-square test. The test statistic for a chi-square test is a measure of how far the actual frequencies are with respect to the expected frequencies if the Null-Hypothesis was true. The higher the value of the test statistic, the stronger is the evidence in favor of the alternate hypothesis. 7