1 Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical Quantitative ANOVA Quantitative Quantitative Regression Quantitative Categorical (not discussed) When our data consists of a quantitative response variable and one or more categorical explanatory variables, we can employ a technique called analysis of variance, abbreviated as ANOVA. The material in this chapter corresponds to the first part of Chapter 14 of the textbook. Recall that a categorical explanatory variable is also called a factor. In this chapter, we ll study the simplest form of ANOVA, one-way ANOVA, which uses one factor and one response variable. We ll also study a more complicated setup in the next chapter, two-way ANOVA, which uses two factors instead. (In principle, we could do ANOVA with any number of factors, but in practice, people usually stick to one or two.) 4.1 Basics of One-Way ANOVA Let s start by discussing the way we organize and label the data for oneway ANOVA. We also need to formulate the basic question that we plan to ask.
2 4.1 Basics of One-Way ANOVA 55 Z Setup Typically, when we think about one-way ANOVA, we think about the factor as dividing the subjects into groups. The goal of our analysis is then to compare the means of the subjects in each group. Notation Let g represent the number of groups. Then we ll set things up as follows: ˆ Let µ 1, µ 2,..., µ g represent the true population means of the response variable for the subjects in each group. As usual, these population parameters are what we re really interested in, but we don t know their values. ˆ We call each observation in the sample Y ij, where i is a number from 1 to g that identifies the group number, and j identifies the individual within that group. (For example, Y 12 represents the response variable value of the second individual in the first group.) ˆ We can calculate the sample means for each group, which we ll call Ȳ 1Y, Ȳ2Y,..., ȲgY. We can use these known sample means as estimates of the corresponding unknown population means. Example 4.1: Suppose we want to see if three McDonald s locations around town tend to put the same amount of fries in a medium order, or if some locations put more fries in the container than others. We take the next 30 days on the calendar and randomly assign 10 days to each of the three locations. On each day, we go to the specified location, order a medium order of fries, take it home, and weigh it to see how many ounces of fries it contains. The categorical explanatory variable is just which location we went to, and the quantitative response variable is the number of ounces of fries. For each of the three locations (g 3), the population consists of all medium orders of fries sold at that location, while the sample consists of the orders that we actually got. The population means, which we call µ 1, µ 2, µ 3, represent the average number of ounces of fries in all orders at each location, and these are the quantities we re interested in. We estimate them using Ȳ1Y, Ȳ2Y, Ȳ3Y, the sample means for each location, which are collected from the data for our orders. The data is shown in Figure 4.1. n
3 4.1 Basics of One-Way ANOVA 56 Location Fries (ounces) Mean Std. Dev Figure 4.1: Ounces of fries in 10 medium orders of fries at each of three McDonald s locations Question of Interest What we really want to know is whether all of the groups have the same population mean, that is, whether µ 1, µ 2,..., µ g are all the same. This is equivalent to asking whether or not the response variable depends on the factor. Intuitively speaking, the most obvious way to answer this question is by looking at Ȳ1Y, Ȳ2Y,..., ȲgY, the sample means of the various groups. If they are close enough to each other, in some sense, then we re willing to believe that all the true population means µ 1, µ 2,..., µ g are the same. If one or more of Ȳ1Y, Ȳ2Y,..., ȲgY are too far from the others, then that convinces us that the true population means must not all be the same. All that remains is to figure out what we mean by close enough and too far. We ll eventually see how to do this with a hypothesis test. Z One-Way ANOVA Table ANOVA gets its name (analysis of variance) from the fact that it examines different kinds of variability in the data. It then uses this information to construct a hypothesis test. To describe these different kinds of variability, we ll first need to introduce some more notation:
4 4.1 Basics of One-Way ANOVA 57 ˆ ȲYY represents the overall sample mean of all the data from all groups combined. ˆ N is the total number of observations, and n i is the number of observations in the ith group. (So n 1 n 2 n g N.) Sums of Squares The most basic quantities that ANOVA uses to describe different kinds of variability are the sums of squares, abbreviated SS. One-way ANOVA involves three sums of squares: ˆ The total sum of squares, SS Tot, measures the overall variability in the data by looking at how the Y ij values vary around ȲYY, their overall mean. Its formula is SS Tot g n i Q i 1 Q» 1 It can be seen from the formula that SS Tot~ˆN is what we would get if we lumped all N observations together, ignoring groups, and calculated the sample standard deviation. j 1 Y ij ȲYYŽ2. ˆ The group sum of squares, SS G, measures the variability between the groups by looking at how the sample means for each group, ȲiY, vary around ȲYY, the overall mean. Its formula is SS G g Q i 1 n i Ȳ iy ȲYYŽ2. ˆ The error sum of squares, SS E, measures the variability within the groups by looking at how each Y ij value varies around ȲiY, the sample mean for its group. Its formula is SS E g n i Q i 1 Q j 1 Y ij ȲiYŽ2. If we call the sample standard deviation within each group s i, then another formula for SS E is SS E g Q i 1ˆn i 1 s 2 i.
5 4.1 Basics of One-Way ANOVA 58 It turns out to be true that SS Tot SS G SS E. In words, the total variability equals the sum of the variability between groups and the variability within groups. Degrees of Freedom The sums of squares are supposed to measure different kinds of variability in the data, but they also tend to be influenced in various ways by the number of groups g and the number of observations N. This influence is measured by quantities called degrees of freedom that are associated with each sum of squares. Their formulas are df Tot N 1, df G g 1, df E N g. Notice that df Tot df G df E. The group and error degrees of freedom add to the total, just like the sums of squares do. Mean Squares The mean squares are just the sums of squares divided by their degrees of freedom: MS G SS G df G, MS E SS E df E. (We seldom bother calculating MS Tot, because it s just the square of the sample standard deviation of all N observations lumped together.) MS G and MS E measure the variability between groups and within groups in a way that properly accounts for g and N, unlike SS G and SS E. Table We typically summarize all this information in an ANOVA table. An ANOVA table for one-way ANOVA is laid out as shown in Figure 4.2. (A few other quantities that we ll calculate later are also sometimes included as extra columns on the right side of the ANOVA table.) Example 4.2: The ANOVA table for the data shown in Figure 4.1 would obviously be very tedious to calculate by hand, so we use computer software to calculate the ANOVA table shown in Figure 4.3. n
6 4.2 One-Way ANOVA F Test 59 Source df SS MS Group df G SS G MS G Error df E SS E MS E Total df Tot SS Tot Figure 4.2: Generic one-way ANOVA table. Source df SS MS Group Error Total Figure 4.3: ANOVA table for the data in Figure One-Way ANOVA F Test The focus of ANOVA is a hypothesis test for checking whether all the groups have the same population mean. This is the same as testing whether the response variable depends on the factor. Sometimes we ll refer to this as a test for whether the factor has an effect on the response variable (although it may not be right to think about this as a literal cause-andeffect relationship). Z One-Way ANOVA F Test Procedure Like any other hypothesis test, the one-way ANOVA F test consists of the standard five steps. Assumptions The one-way ANOVA F test makes four assumptions: ˆ The data comes from a random sample or randomized experiment. In an observational study, the subjects in each group should be a random sample from that group. In an experiment, the subjects should be randomly assigned to the groups.
7 4.2 One-Way ANOVA F Test 60 ˆ The data for each group should be independent. For example, we wouldn t want to reuse the same subject for measurements in more than one group. ˆ For each group, the population distribution of the response variable has a normal distribution. To check this assumption, there a couple of things we should look for: The shape of the data should look at least sort of close to normal. There should be no outliers. ˆ The population distribution of the response variable has the same standard deviation σ for each group. Of course, we don t know σ, but we can still check this assumption by comparing the sample standard deviations for each group. As an approximate rule of thumb, we typically don t worry unless one group s standard deviation is more than twice as big as another s. Note: The textbook organizes these four assumptions a little differently. It combines my first two and my last two, and so it lists only two assumptions. Hypotheses The null hypothesis for the one-way ANOVA F test is that the factor has no effect, and the alternative is that it does. In terms of parameters, we can write these hypotheses as follows: H 0 : µ 1, µ 2,..., µ g are all equal. H a : µ 1, µ 2,..., µ g are not all equal. Test Statistic If we re testing whether or not µ 1, µ 2,..., µ g are all equal, then it seems reasonable to look at our estimates of those quantities and see if those are all close enough to each other. So we want to look at whether Ȳ 1Y, Ȳ2Y,..., ȲgY are all close enough to each other. We measure the closeness of the group means using MS G, the variability between groups. But there s something else we need to consider as
8 4.2 One-Way ANOVA F Test 61 State GA AL FL $2.06 $2.15 $2.25 Data $2.05 $2.16 $2.24 $2.04 $2.15 $2.26 $2.05 $2.14 $2.25 Mean $2.05 $2.15 $2.25 State GA AL FL $2.37 $2.42 $2.07 Data $1.73 $2.02 $1.83 $1.97 $2.18 $2.47 $2.13 $1.78 $2.23 Mean $2.05 $2.15 $2.25 Figure 4.4: Two hypothetical data sets for a study of gas prices. well. Look at the data in Figure 4.4, which shows some hypothetical data comparing gas prices from three different states. Notice that the sample mean for each group (state) is the same for both data sets, so the variability between groups, MS G, is the same as well. However, common sense says that the data set on the left is much more convincing that there is an actual difference from group to group. Mathematically, this is because the data set on the left has less variability within groups, which we measure with MS E. Our test statistic compares the variability between groups to the variability within groups by taking a ratio: F MS G MS E. When MS G is large compared to MS E, like the hypothetical data set on the left, F will be large. So larger F values represent more evidence that there is a difference between the group population means in other words, more evidence against H 0 and in favor of H a. P-Value and the F Distribution Recall the definition of the p-value: The p-value is the probability of getting a test statistic value at least as extreme as the one observed, if H 0 is true. Typically the p-value is a tail probability from whatever kind of statistical distribution the test statistic has when H 0 is true. For the one-way ANOVA F test statistic, we call this distribution an F distribution, like the ones shown in Figure 4.5.
9 4.2 One-Way ANOVA F Test Value of F Value of F Figure 4.5: Density of the F distribution for df 1 2, df 2 27 (left) and df 1 3, df 2 40 (right). An F distribution has the following properties: ˆ It is skewed right. ˆ Things with an F distribution can t be negative, so the F distribution has only one tail. (We never need to double any tail probabilities from an F distribution.) ˆ The center of the F distribution is usually somewhere around 1, or a little less. ˆ The exact shape of the F distribution is determined by two different degrees of freedom the numerator degrees of freedom, or df 1, and the denominator degrees of freedom, or df 2. If H 0 is true, our test statistic, F, has an F distribution with df 1 df G and df 2 df E. This is easy to remember, since the formula for F is F MS G MS E, and the numerator and denominator degrees of freedom are just the degrees of freedom associated with the quantities in the numerator and denominator of F. Remember that we said the larger values of F are the values that are more supportive of H a. So the p-value is the probability of getting an
10 4.2 One-Way ANOVA F Test 63 F value larger than the one we actually got, if H 0 is true. Since the test statistic F has an F distribution if H 0 is true, this probability is represented by the shaded area in Figure 4.6. To calculate this probability exactly, we typically need statistical software Value of F Figure 4.6: Tail probability of an F distribution with df 1 3, df If we don t have access to statistical software, we often have to use an F table like the one in the back of our textbook to try to figure out the p-value. Ideally, we would go to our F table, find the correct df 1 and df 2, look up our F value, and it would tell us the p-value. Unfortunately, that s way too much information and would require our F table to be dozens of pages long. Instead, a typical F table, like the one in Figure 4.7, works a little differently. For each combination of df 1 and df 2, the table tells us only a single number. That number is the F value corresponding to a p-value of We then check whether our observed F test statistic value is larger or smaller than the one listed in the table. ˆ If our test statistic value is larger than the number in the table, then our p-value is smaller than ˆ If our test statistic value is smaller than the number in the table, then our p-value is larger than We can see that the p-value behaves as it should: Smaller p-values correspond to larger F values, and both correspond to more evidence against
11 4.2 One-Way ANOVA F Test 64 df 1 df Figure 4.7: Top-left corner of an F table for right-tail probabilities of H 0 and in support of H a. Decision We make a decision the same way we always do for any hypothesis test: by rejecting H 0 if the p-value is less than or equal to α (often 0.05), and failing to reject H 0 if the p-value is greater than α. Remember that the hypotheses we re testing are H 0 : µ 1, µ 2,..., µ g are all equal. H a : µ 1, µ 2,..., µ g are not all equal. So let s think about what our decision really represents. ˆ If we reject H 0, then we re concluding that at least some of the group population means are different. ˆ If we fail to reject H 0, then we re concluding that it s reasonable that all the group population means are the same. Example 4.3: Let s go through the five steps of the one-way ANOVA F test for the data in Example 4.2 using α Let s check each of the four assumptions. ˆ Each day was randomly assigned to a particular location, so this is a randomized experiment. ˆ The different groups correspond to different locations, each of which should have no ability to affect the measurements of the other two, so the groups are independent.
12 4.2 One-Way ANOVA F Test 65 ˆ It s very hard to tell much about the shape of the data with only 10 observations in each group, but quick dotplots for each group show shapes that are at least somewhat consistent with a normal distribution. Also, we see no outliers in any of the groups. ˆ The sample standard deviations for the three groups are at least sort of close to each other, so we don t see any violation of the constant standard deviation assumption. Our assumptions are okay, so we can proceed. 2. The null hypothesis is that µ 1 µ 2 µ 3, which means that the three locations, on average, give out the same amount of fries. The alternative hypothesis is that at least one of µ 1, µ 2, µ 3 is not equal to the others, which means that at least one of the locations gives out more or fewer fries than the others. 3. The test statistic F is calculated from the mean squares in the ANOVA table shown in Figure 4.3: F MS G MS E To calculate our p-value, we compare our observed test statistic value to an F distribution with df 1 df G 2 and df 2 df E 27. When we consult the F table for these df values, the number that it gives us is This means that for an F distribution with these degrees of freedom, a test statistic value of 3.35 would correspond to a p-value of Our observed test statistic value of 3.55 is larger than 3.35, so our p-value is smaller than (We could use statistical software to calculate the exact p-value, which turns out to be ) 5. Our p-value is smaller than our α, so we reject H 0. We can conclude that the three locations do not give out the same amount of fries. However, we can t conclude anything about which locations give out more or less fries than the others, or about how many more or less they give out. n Figure 4.8 may be helpful for remembering various results and interpretations of a one-way ANOVA F test.
13 4.2 One-Way ANOVA F Test 66 Z Large F value Small F value (much larger than 1) (around 1 or less than 1) Small p-value Large p-value Evidence against H 0 (for H a ) No evidence against H 0 (for H a ) Reject H 0 Fail to reject H 0 Conclude that some population Reasonable that all population group means differ group means are the same Figure 4.8: Results and interpretations of a one-way ANOVA F test Alternatives to the One-Way ANOVA F Test There are some situations in which one-way ANOVA could be used, but another test procedure might be equivalent or preferable. One-Way ANOVA with Two Groups When we have only two groups, then the one-way ANOVA F test serves exactly the same purpose as the two-sided two-sample t test from Section 10.2, which you saw in your previous course. It turns out that oneway ANOVA with only two groups is completely equivalent to the two-sided two-sample t test, in the sense that both tests will give exactly the same p-value. (This happens because their test statistics are related: F t 2.) So in this case, it makes no difference which procedure is used, since both will yield exactly the same conclusion. However, the two-sample t test is slightly more flexible in this case since it also allows us to use a one-sided alternative hypothesis if we so desire. Ordinal Variables If the factor is an ordinal variable, one-way ANOVA makes no use of the ordering information. There exist other test procedures that might make slightly fewer type II errors than one-way ANOVA by taking into account
14 4.2 One-Way ANOVA F Test 67 the order of the factor categories, but we won t discuss these procedures here. Normality One-way ANOVA assumes that the data in each group comes from a normal distribution. Even if the distribution is somewhat different from normal, one-way ANOVA can still work okay if the sample sizes are large enough. However, when sample sizes are small, one-way ANOVA can be unreliable if the data in one or more of the groups comes from a highly non-normal distribution. There exists a nonparametric equivalent of the one-way ANOVA F test called the Kruskal-Wallis test that uses only the ranks of the data and is okay to use no matter what distribution the data comes from. We won t discuss the details, but Section 15.2 of the textbook gives a brief outline. Block Designs Recall from Stats 1 that when we wanted to compare the means of two groups, there were two different procedures: ˆ The two-sample t test compared groups when the data in one group was independent from the data in the other group. ˆ The matched-pairs t test compared groups when each observation in one group was paired with a corresponding observation in the other group (such as husbands and wives, or before and after measurements). The one-way ANOVA F test we discussed in this section is the multiplegroup analog of the two-sample t test. (That s why they re equivalent when there are only two groups.) As mentioned in the assumptions, it can t be used when the observations in a group correspond to observations in other groups. There also exists a procedure called a block design that is the multiplegroup analog of the matched-pairs t test. It should be used instead of simple one-way ANOVA when each subject is re-used for measurements in each group. There are many cases where such a procedure is useful.
15 4.3 One-Way ANOVA Confidence Intervals 68 Example 4.4: Suppose we want to compare the effectiveness of three kinds of fertilizer for growing corn. We have five plots of land available to use, so we divide each plot into thirds and use one fertilizer on each third. Here the plots of land are the subjects and the fertilizers are the groups. Each subject is being reused for each group, so we can t use the one-way ANOVA procedure we discussed in this section. However, this type of data can be analyzed using a block design. n Unfortunately, we won t have time to discuss block designs in detail in this course. The textbook doesn t discuss them either, so if for some reason you need to learn about them, consult another textbook instead. (I can give you a reference if you re interested.) 4.3 One-Way ANOVA Confidence Intervals The one-way ANOVA F test allows us to conclude whether or not the population group means are all equal. However, we might also want to say something about what we think the group means actually are, or about which group means are different and by how much. We can answer these questions by constructing confidence intervals. Since there are multiple quantities for which we might want to construct confidence intervals in a one-way ANOVA setup, we need to discuss the right way to do this. Z Simultaneous Confidence Intervals When we construct more than one confidence interval at a time, we have to be careful to maintain our specified overall confidence level. For example, if we re 95% confident in the statement µ 1 is between 78 and 86, and we re also 95% confident in the statement µ 2 is between 31 and 39, then we ll (usually) be less than 95% confident in the combined statement µ 1 is between 78 and 86 and µ 2 is between 31 and 39. When we want to state a certain overall confidence level for several confidence intervals simultaneously, we need to construct simultaneous confidence intervals. (If we re only interested in setting the confidence level for one confidence interval at a time, then we might call this an individual confidence level, to distinguish it from an overall simultaneous confidence level.)
16 4.3 One-Way ANOVA Confidence Intervals 69 Multiple Comparison Methods To construct simultaneous confidence intervals, we have to use something called a multiple comparison method. There are a variety of multiple comparison methods, and the best one to use depends on what kind of confidence intervals we plan to construct. We won t discuss the details here. Z Confidence Intervals for Group Means The most obvious quantities for which we might want to construct confidence intervals are µ 1,..., µ g, the population means of the groups. Since we re constructing multiple confidence intervals at once, we ll need to use a multiple comparison procedure. Many different multiple comparison methods exist for this situation, and one of the most commonly used is the Bonferroni method. We ll refer to the intervals it produces as Bonferroni simultaneous confidence intervals. Assumptions The assumptions for constructing confidence intervals for group means are the same as those for the one-way ANOVA F test. Estimating the Standard Deviation Recall º that one of our assumptions is that each group has the same population standard deviation, which we call σ. We can estimate σ using ˆσ MS E. This quantity will show up in the confidence interval formula, but it might also be useful in its own right. Example 4.5: In Example 4.2, we calculated MS E Hence our estimate for the population standard deviation σ of each group is ˆσ n Formula To construct a set of Bonferroni simultaneous confidence intervals µ 1,..., µ g, we can use the following formula for each µ i : CI for µ i Ȳ iy t ˆσ¾ 1 n i,
17 4.3 One-Way ANOVA Confidence Intervals 70 where t is a number that depends on the confidence level, N, and g. We won t discuss the details of how to get t in this chapter, but we may come back to it later (the Bonferroni method will come up again in a later chapter). Example 4.6: For Example 4.2, simultaneous 95% Bonferroni confidence intervals for the three group means, as calculated by statistical software, are as follows: µ 1 ˆ3.85, 4.87 µ 2 ˆ3.58, 4.60 µ 3 ˆ3.10, 4.12 Since this is a set of simultaneous confidence intervals, we can say that we re 95% confident that all three parameter values are in their corresponding intervals. n Z Confidence Intervals for Differences of Group Means The one-way ANOVA F test only tells us whether there are differences between the groups. It does not give a verdict on which groups are different, or by how much. To figure this out, we can construct confidence intervals to compare each pair of group population means. More specifically, we want to construct simultaneous confidence intervals for µ i µ k for each pair of groups k. For example, with three groups, there would be three quantities for which we would want to construct confidence intervals: µ 1 µ 2, µ 1 µ 3, and µ 2 µ 3. Many different multiple comparison methods exist for this situation, but the best one for our purposes is called the Tukey method. We ll refer to the intervals it produces as Tukey simultaneous confidence intervals. Assumptions The assumptions for constructing Tukey simultaneous confidence intervals are exactly the same as those for the one-way ANOVA F test, with one additional requirement: the group sample sizes n 1, n 2,..., n g should be at least approximately equal.
18 4.3 One-Way ANOVA Confidence Intervals 71 Formula To construct a set of Tukey simultaneous confidence intervals for each pair of groups i and k, we can use the following formula for each k: CI for µ i µ k Ȳ iy ȲkYŽq ˆσ¾ 1 n i 1 n k, where q is a number that depends on the confidence level, N, and g. We won t discuss the details of how to get q, since we would typically use statistical software to calculate it for us. Interpretation For each comparison of two groups, we interpret the corresponding Tukey simultaneous confidence interval as follows: ˆ If the interval contains only positive numbers, then we can conclude that the first of the two population means being compared is bigger than the second. ˆ If the interval contains only negative numbers, then we can conclude that the first of the two population means being compared is smaller than the second. ˆ If the interval contains both positive and negative numbers (in other words, if it contains zero), then we can t conclude that either of the two population means being compared is bigger than the other. Of course, whenever we conclude that one population mean is bigger than another, the interval also gives us an idea of how much bigger. Example 4.7: For Example 4.2, Tukey simultaneous 95% confidence intervals, as calculated by statistical software, are as follows: µ 1 µ 2 ˆ0.44, 0.98 µ 1 µ 3 ˆ0.04, 1.46 µ 2 µ 3 ˆ0.23, 1.19 So we can t conclude that there s any difference between µ 1 and µ 2 or between µ 2 and µ 3, since both of the corresponding intervals contain both positive and negative numbers. However, we can conclude that µ 1 is bigger than µ 3, since the corresponding interval contains only positive numbers.
19 4.3 One-Way ANOVA Confidence Intervals 72 In other words, we can conclude that Location 1 gives out more fries than Location 3, but we can t conclude anything about how Location 2 compares to either of them. n