Nonparametric and Distribution- Free Statistical Tests

Transcription

1 20 Nonparametric and Distribution- Free Statistical Tests Concepts that you will need to remember from previous chapters SS total, SS group, SS error : Sums of squares of all scores, of group means, and within groups MS group, MS error : Mean squares for group means, and within groups F statistic: Ratio of MS group over MS error Degrees of freedom: The number of independent pieces of information remaining after estimating one or more parameters Effect size ( dˆ ): A measure intended to express the size of a treatment Eta 2 1h 2 2, omega 2 1v 2 2: Correlation based measures of effect size Multiple comparisons: Tests on differences between specific group means in terms that are meaningful to the reader 536

2 Nonparametric and Distribution-Free Statistical Tests 537 In this chapter we are going to change our general approach to hypothesis testing and look at procedures that rely on substituting ranks for raw scores. These are members of the class of nonparametric, or distribution-free, tests. We will first look at the underlying principle that stands behind such tests and then discuss the reasons why one might prefer to use this kind of test. We will see that these tests are a supplement to what we have learned, not a replacement. Most of the statistical procedures we have discussed in the preceding chapters have involved the estimation of one or more parameters of the distribution of scores in the population(s) from which the data were sampled and assumptions concerning the shape of that distribution. For example, the t test makes use of the sample variance 1s 2 2 as an estimate of the population variance 1s 2 2 and also requires the assumption that the population from which we sampled is normal (or at least that the sampling distribution of the mean is normal). Tests, such as the t test, that involve assumptions either about specific parameters or about the distribution of the population are referred to as parametric tests. Definition Parametric tests: Statistical tests that involve assumptions about, or estimation of, population parameters. Nonparametric tests: Statistical tests that do not rely on parameter estimation or precise distributional assumptions. Distribution-free tests: Another name for nonparametric tests. One class of tests, however, places less reliance on parameter estimation and/or distribution assumptions. Such tests usually are referred to as nonparametric tests or distribution-free tests. By and large if a test is nonparametric, it is also distribution-free; in fact, it is the distribution-free nature of the test that is most valuable to us. Although the two names often are used interchangeably, these tests will be referred to here as distribution-free tests. The argument over the value of distribution-free tests has gone on for many years, and it certainly cannot be resolved in this chapter. Many experimenters feel that, for the vast majority of cases, parametric tests are sufficiently robust (unaffected by violations of assumptions) to make distribution-free tests unnecessary. Others, however, believe just as strongly in the unsuitability of parametric tests and the overwhelming superiority of the distribution-free approach. (Bradley [1968] is a forceful and articulate spokesman for the latter group, even though his book on the subject is over 40 years old.) Regardless of the position you take on this issue, it is important that you are familiar with the most common distribution-free procedures and their underlying rationale. These tests are too prevalent in the experimental literature simply to be ignored. The major advantage generally attributed to distribution-free tests is also the most obvious they do not rely on any seriously restrictive assumptions concerning the shape of the sampled population(s). This is not to say that distribution-free tests do not make any distribution assumptions, only that the assumptions they do require are far

3 538 Chapter 20 Nonparametric and Distribution-Free Statistical Tests more general than those required for the parametric tests. The exact null hypothesis being tested may depend, for example, on whether two populations are symmetric or have a similar shape. None of these tests, however, makes an a priori assumption about the specific shape of the distribution; that is, the validity of the test is not affected by whether the distribution of the variable in the population is normal. A parametric test, on the other hand, usually includes some type of normality assumption; if that assumption is false, the conclusions drawn from the test may be inaccurate. Another characteristic of distribution-free tests that often acts as an advantage is that many of them, especially the ones discussed in this chapter, are more sensitive to medians than to means. Thus if the nature of your data is such that you are interested primarily in medians, the tests presented here may be particularly useful to you. Those who favor using parametric tests in every case do not deny that the distribution-free tests are more liberal in the assumptions they require. They do argue, however, that the assumptions normally cited as being required of parametric tests are overly restrictive in practice and that the parametric tests are remarkably unaffected by violations of distribution assumptions. In other words, they argue that the parametric test is still a valid test even if all of its assumptions are not met. The major disadvantage generally attributed to distribution-free tests is their lower power relative to the corresponding parametric test. In general, when the assumptions of the parametric test are met, the distribution-free test requires more observations than the comparable parametric test for the same level of power. Thus for a given set of data the parametric test is more likely to lead to rejection of a false null hypothesis than is the corresponding distribution-free test. Moreover, even when the distribution assumptions are violated to a moderate degree, the parametric tests are thought to maintain their advantage. It often is claimed that the distribution-free procedures are particularly useful because of the simplicity of their calculations. However, for an experimenter who has just invested six months collecting data, a difference of five minutes in computation time hardly justifies the use of a less desirable test. Moreover, since most people run their analyses using computer software, the difference in ease of use disappears completely. There is one other advantage of distribution-free tests. Because many of them rank the raw scores and operate on those ranks, they offer a test of differences in central tendency that are not affected by one or a few very extreme scores (outliers). An extreme score in a set of data actually can make the parametric test less powerful because it inflates the variance and hence the error term, as well as biasing the mean by shifting it toward the outlier (the latter may increase or decrease the difference between means). In this chapter we will be concerned with four of the most important distribution-free methods. The first two are analogues of the t test, one for independent samples and one for matched samples. The next two tests are distribution-free analogues of the analysis of variance, the first for k independent groups and the second for k repeated measures. All these tests are members of a class known as rank-randomization tests because they deal with ranked data and take as the distribution of their test statistic, when the null hypothesis is true, the theoretical distribution of randomly distributed ranks. I ll come back to this idea shortly. Because these tests convert raw data to ranks, the shape of the underlying distribution of scores in the population becomes less important. Thus both the sets

4 Nonparametric and Distribution-Free Statistical Tests (data that might have come from a normal distribution) and (data that might have come from a bimodal distribution) reduce to the ranks Definition Rank-randomization tests: A class of nonparametric tests based on the theoretical distribution of randomly assigned ranks. The use of methods based on ranks is not the only approach when we are concerned about nonnormality, though it is the most common. Wilcox (2003) has an extensive discussion of newer alternative methods (often relying on the trimming of samples), though there is not space to discuss those methods here. Why do we use ranks? You might reasonably ask why we would use ranks to run any of the tests in this chapter. There are three good reasons why these tests were designed around the substitution of ranks for raw data. In the first place, ranks can eliminate or reduce the effects of extreme values. The two highest ranks of 20 items will be the values 19 and 20. But the highest raw score values could be 77 and 78 or 77 and 130. It makes a difference with raw scores, but not with ranks. A second advantage of ranks is that we know certain of their properties, such as that the sum of a set of ranks is N 3 1N 1 12>2. This greatly simplifies calculations. This was especially important in the days before high speed computers. The third advantage is that once you have worked out the critical value of the test statistic when you have 8 observations in one group and 13 in another, you never have to solve that problem again. The next time you have 8 scores in one group and 13 in another, converting to ranks will yield the same critical value. However, with raw scores you would have to set a cutoff for every conceivable collection of 8 scores in one group and 13 in another. However, while ranks provided an easy solution when we had to do calculations by hand, that advantage is now largely gone. There is a whole set of statistical tests called randomization tests (or sometimes permutation tests) that work by randomizing raw scores. For the Mann-Whitney test we converted to ranks and then asked about all of the possible ways those ranks could have been assigned to groups if the null hypothesis were true. As I said, ranks made it easy to acquire all possible arrangements and identify the 5% most extreme ones. But now we can

5 540 Chapter 20 Nonparametric and Distribution-Free Statistical Tests do exactly the same thing with raw scores. We can write a very simple computer program that randomly assigns scores to groups, calculates some statistic, and then repeats that process 5,000 times or more in a very few seconds. Then we identify the 5% most extreme outcomes and that gives us our critical value. And if there are two many possible permutations of the raw scores to make the enumeration practical, we can pick a random 5,000 or 10,000 rearrangements, and that will gives us a result that is acceptably close to the result of the full solution. If R. A. Fisher were still around, he would argue that when the randomization of raw scores gives a result that is more than trivially different from the results of a parametric test such as t or F, then it is the t or F that is wrong The Mann Whitney Test One of the most common and best known of the distribution-free tests is the Mann Whitney test for two independent samples. This test often is thought of as the distribution-free analogue of the t test for two independent samples, although it tests a slightly different, and broader, null hypothesis. Its null hypothesis is the hypothesis that the two samples were drawn at random from identical populations (not just populations with the same mean), but it is especially sensitive to population differences in central tendency. Thus rejection of H 0 generally is interpreted to mean that the two distributions had different central tendencies, but it is possible that rejection actually resulted from some other difference between the populations. Notice that when we gain one thing (freedom from assumptions), we pay for it with something else (loss of specificity). Definition Mann Whitney test: A nonparametric test for comparing the central tendency of two independent samples. The Mann Whitney test is a variation on a test originally devised by Wilcoxon called the Rank-Sum test. Because Wilcoxon also devised another test, to be discussed in the next section, we will refer to this version as the Mann Whitney test to avoid confusion. Although the test as devised by Mann and Whitney used a slightly different test statistic, the statistic used in this chapter (the sum of the ranks of the scores in one of the groups) is often advocated because it is much easier to calculate. (In fact, this is the statistic that Wilcoxon uses for his test. So, to be honest, I am calling this the Mann Whitney test but doing it the way Wilcoxon proposed.) The result is the same, because either way of computing a test statistic would lead to exactly the same conclusion when applied to the same set of data. The logical basis of the Mann-Whitney test is particularly easy to understand. Assume that we have two independent treatment groups, with observations in n 1

6 20.1 The Mann Whitney Test 541 Group 1 and n 2 observations in Group 2. To make it concrete, assume that there are 8 observations in each group. Further assume that we don t know whether or not the null hypothesis is true, but we happen to obtain the following data: Raw Scores Group Group Well, it looks as if Group 2 outscored Group 1 by a substantial margin. Now suppose that we rank the data from lowest to highest, without regard to group membership. Ranked Scores Group Ranks 36 Group Ranks 100 Look at that! The lowest 8 ranks ended up in Group 1 and the highest 8 ranks ended up in Group 2. That doesn t look like a very likely event if the two populations don t differ. We could calculate how often such a result would happen if we really need to, and if you are very patient. Although it could be done mathematically, we could do it empirically by taking 16 balls and writing the numbers 1 through 16 on them, corresponding to the 16 ranks. (We don t have to worry about actual scores, because we are going to replace scores with ranks anyway.) Now we will toss all of the balls into a bucket, shake the bucket thoroughly, pull out 8 balls, which will correspond with the ranks for Group 1, record the sum of the numbers on those balls, toss them back into the bucket, shake and draw again, record the sum of the numbers, and continue that process all night. By the next morning we will have drawn an awful lot of samples, and we can look at the values we recorded and make a frequency distribution of them. This will tell us how often we had a sum of the ranks of only 36, how often the sum was 37, how often it was 50, or 60, or 90, or whatever. Now we really are finished. We know that if we just draw ranks out at random, only very rarely will we get a sum as small as 36. (A simple calculation shows that an outcome as extreme as ours would be expected to occur only one time out of 12,870, for a probability of ) If the null hypothesis is really true, then there should be no systematic reason for the first group to have only the lowest ranks. It should have ranks that are about like those of the second group. If the ranks in Group 1 are improbably low, that is evidence against the null hypothesis. I mentioned above that this is a rank randomization test, and what we have just done illustrates where the name comes from. We run the test by looking at what would happen if we randomly assigned scores (or actually ranks) to groups, even if we don t actually go through the process of doing the random assignment ourselves.

7 542 Chapter 20 Nonparametric and Distribution-Free Statistical Tests Now consider the case in which the null hypothesis is true and the scores for the two groups were sampled from identical populations. In this situation if we were to rank all N scores without regard to group membership, we would expect some low ranks and some high ranks in each group, and the sum of the ranks assigned to Group 1 would be roughly equal to the sum of the ranks assigned to Group 2. Reasonable results for the situation with a true null hypothesis are illustrated. Raw Scores Group Group Now it looks as if Group 2 scores are not a lot different from Group 1 scores. We can rank the data across both groups. Ranked Scores Group Ranks 64 Group Ranks 72 Here the sum of the ranks in Group 1 is not much different from the sum of the ranks in Group 2, and a sum like that would occur quite often if we just drew ranks at random. Mann and Whitney (and Wilcoxon) based their tests on the logic just described, using the sum of the ranks in one of the groups as the test statistic. If that sum is too small relative to the other sum, we will reject the null hypothesis. More specifically, we will take as our test statistic the sum of the ranks assigned to the smaller group, or if n 1 5 n 2 the smaller of the two sums. Given this value, we can use tables of the Mann Whitney statistic 1W S 2 to test the null hypothesis. (They needed to concern themselves with only one of the sums, because with a fixed set of numbers [ranks], the sum of the ranks in one group is directly related to the sum of the ranks in the other group. If one sum is high, the other must be low.) To take a specific example, consider the data in Table 20.1 on the number of recent stressful life events reported by a group of cardiac patients in a local hospital and a control group of orthopedic patients in the same hospital. It is well known that stressful life events (marriage, new job, death of a spouse, etc.) are associated with illness, and it is reasonable to expect that many cardiac patients would have experienced more recent stressful events than orthopedic patients (who just happened to break an ankle while tearing down a building or a collarbone while skiing). It would appear from the data that this expectation is borne out. Because we have some reason to suspect that life stress scores probably are not symmetrically distributed in the population (especially for cardiac patients if our research hypothesis is true), we will choose to use a distribution-free test. In this case we will use the Mann Whitney test because we have two independent groups.

8 20.1 The Mann Whitney Test 543 Table 20.1 Stressful Life Events Reported by Cardiac and Orthopedic Patients Cardiac Patients Orthopedic Patients Data Ranks To apply the Mann Whitney test, we first rank all 11 scores from lowest to highest, assigning tied ranks to tied scores. The orthopedic group is the smaller of the two, and if those patients generally have had fewer recent stressful life events, then the sum of the ranks assigned to that group would be relatively low. Letting W S stand for the sum of the ranks in the smaller group (the orthopedic group), we find W S 5 1R i 2 in smaller group W S W S We can evaluate the obtained value of by using Table E.8 in the Appendix E, which gives the smallest value of W S we would expect to obtain by chance if the null hypothesis were true. From Table E.8 we find that for n subjects in the smaller group and n subjects in the larger group ( n 1 is always used to represent the number of subjects in the smaller group) the entry for a5.025 (one-tailed) is 18. This means that for a difference between groups to be significant at the two-tailed.05 level (or the one-tailed.025 level), W S must be less than or equal to 18. Because we found W s to be 21, we cannot reject H 0. (By way of comparison, if we ran a t test on these data, ignoring the fact that one sample variance is almost 50 times the other and that the data suggest that our prediction of the shape of the distribution of cardiac scores may be correct, t would be 1.52 on 9 df, which is also a nonsignificant result.) As an aside, I should point out that we would have rejected H 0 if our value of W S was smaller than the tabled value. Until now you have been rejecting H 0 when the obtained test statistic was larger than the corresponding tabled value. When we work with nonparametric tests the tables are usually set up to lead to rejection for small obtained values. If I were redesigning statistical procedures, I would set the tables up differently, but nobody asked me. Just get used to the fact that parametric tables are set up such that you reject H 0 for large obtained values, and nonparametric tables are often set up so that you reject for small values. That s just the way it is. The entries in Table E.8 are for a one-tailed test and will lead to rejection of the null hypothesis only if the sum of the ranks for the smaller group is sufficiently small. It is possible, however, that the larger ranks could be congregated in the smaller group, in which case if is false, the sum of the ranks would be larger than chance H 0

9 544 Chapter 20 Nonparametric and Distribution-Free Statistical Tests expectation rather than smaller. One rather awkward way around this problem would be to rank the data all over again, this time ranking from high to low, rather than from low to high. If we did that, the smaller ranks would appear in the smaller group, and we could proceed as before. We do not have to go through the process of reranking data, however. We can accomplish the same thing by making use of the symmetric properties of the distribution of the rank sum by calculating a statistic called W S. W S is the sum of the ranks for the smaller group that we would have found if we had reversed our ranking and ranked from highest to lowest: W S 5 2W 2 W S where 2W 5 n 1 1n 1 1 n and is tabled in Table E.8 in Appendix E. For a twotailed test of H 0 (which is what we normally want) we calculate both W S and W S, enter the table with whichever is smaller, and double the listed value of a. For an illustration of and W S, consider the following two sets of data: Set 1 Group 1 Group 2 X Ranks W S 5 11 W S 5 29 Set 2 Group 1 Group 2 X Ranks W S 5 29 W S 5 11 W S Notice that the two data sets exhibit the same degree of extremeness, in the sense that for the first set, four of the five lowest ranks are in Group 1, and in the second set, four of the five highest ranks are in Group 1. Moreover, W S for Set 1 is equal to W S for Set 2 and vice versa. Thus if we establish the rule that we will calculate both W S and W S for the smaller group and refer the smaller of W S and W S to the tables, we will have a two-tailed test and will come to the same conclusion with respect to the two data sets. The Normal Approximation Table E.8 in Appendix E is suitable for all cases in which n 1 and n 2 are less than or equal to 25. For larger values of n 1 and/or we can make use of the fact that the n 2

10 20.1 The Mann Whitney Test 545 distribution of distribution has and approaches a normal distribution as sample sizes increase. This Because the distribution is normal and we know its mean and its standard deviation (the standard error), we can calculate z: z 5 W S Mean 5 n 1 1n 1 1 n Standard error 5 B n 1 n 2 1 n 1 1 n Statistic 2 Mean Standard error 5 W S 2 n 1 1 n 1 1 n B n 1 n 2 1 n 1 1 n and obtain from the tables of the normal distribution an approximation of the true probability of a value of W S at least as low as the one obtained. To illustrate the computations for the case in which the larger ranks fall into the smaller group and to illustrate the use of the normal approximation (although we don t really need to use an approximation for such small sample sizes), consider the data in Table These data are hypothetical (but reasonable) data on the birthweights (in grams) of children born to mothers who did not seek prenatal care until the third trimester and of children born to mothers who received prenatal care starting in the first trimester. For the data in Table 20.2 the sum of the ranks in the smaller group equals 100. From Table E.8 in Appendix E we find 2W 5 152; thus W S 5 2W 2 W S Because 52 is smaller than 100, we go to Table E.8 with W S 5 52, n 1 5 8, and n (Remember, n 1 is defined as the smaller sample size.) Because we want a two-tailed test, we will double the column headings for a. The critical value of W S (or W S ) for a two-tailed test at a5.05 is 53, meaning that only 5% of the time would we expect a value of W S or W S less than or equal to 53 when H 0 is true. Our obtained value of W S is 52, which falls into the rejection region, so we will reject H 0. We will conclude that mothers who do not receive prenatal care until the third trimester tend to give birth to smaller babies. This does not necessarily mean that not having care until the third trimester causes smaller babies, but only that variables associated with delayed care (e.g., young mothers, poor nutrition, and poverty) also are associated with lower birthweight. The use of the normal approximation for evaluating W S is illustrated in the lower section of Table Here we find that z From Table E.10 in Appendix E we find that the probability of W S or W S at least as small as 52 12

11 546 Chapter 20 Nonparametric and Distribution-Free Statistical Tests Table 20.2 Data on Birthweight of Infants Born to Mothers with Different Levels of Prenatal Care Beginning of Care Third Trimester First Trimester Birthweight Rank Birthweight Rank 1, , , , , , , , , , , , , , , , , ,775 6 W S 5 o 1ranks in Group W S 5 2W 2 W S z W S 2 n 1 1n 1 1 n B B n 1 n 2 1n 1 1 n (a z at least as extreme as ; 2.13) is Because this value is smaller than our traditional cutoff of a5.05, we will reject H 0 and again conclude that there is sufficient evidence to say that failing to seek early prenatal care is related to lower birthweight. Note that both the exact solution and the normal approximation lead to the same conclusion with respect to H 0. (With the normal approximation it is not necessary to calculate and use W S because use of W S will lead to the same value of z except for the reversal of its sign. It would be instructive for you to calculate Student s t test for two independent groups from the same set of data.)

12 20.1 The Mann Whitney Test 547 The Treatment of Ties When the data contain tied scores, any test that relies on ranks is likely to be somewhat distorted. There are several different ways of dealing with ties. You can assign tied ranks to tied scores (as we have been doing), you can flip a coin and assign consecutive ranks to tied scores, or you can assign untied ranks in whatever way will make it hardest to reject H 0. In actual practice most people simply assign tied ranks. Although that may not be the statistically best way to proceed, it is the most common and the method we will use here. The Null Hypothesis The Mann Whitney test evaluates the null hypothesis that the two sets of scores were sampled from identical populations. This is broader than the null hypothesis tested by the corresponding t test, which dealt specifically with means (primarily as a result of the underlying assumptions that ruled out other sources of difference). If the two populations are assumed to have the same shape and dispersion, then the null hypothesis tested by the Mann Whitney test would actually deal with the central tendency (in this case the medians) of the two populations; if the populations are also symmetric, the test will be a test of means. In any event the Mann Whitney test is particularly sensitive to differences in central tendency. Using SPSS I will illustrate the use of SPSS for this test, and it should be clear how it would be used for those tests that follow. In Chapter 17 we considered data collected by Willer (2005) on the Masculine Overcompensation Thesis. Those data can be found on the Web site as Tab17.5.dat. The first column represents Gender 11 5 Male2, the second column represents Condition 11 5 Threat2, and the third column contains the dependent variable (Price). In Chapter 17 I mentioned that Willer s data were probably positively skewed, although the data that I created to match his data were more or less normal. This might be a place where the Mann Whitney test would be useful, especially if we had Willer s actual data. I also noted there that Willer was most interested in males and the hypothesis that when males masculinity is questioned, they might engage in more masculine behavior, and so we will limit our analysis to males. To restrict the analysis to data from males, you need to go to the dropdown menu labeled Data, choose Select Cases, and then specify that you want to use only the data from Gender 5 1. Next, choose Analyze/ Nonparametric tests/ 2-independent samples. Next you need to specify that Price is the test variable and that Threat is the Grouping variable. When you do that you also have to indicate that the levels of Threat are 1 and 2. The results of this analysis appear below.

13 548 Chapter 20 Nonparametric and Distribution-Free Statistical Tests Mann Whitney Test Ranks Condition N Mean Rank Sum of Rank Price willing to pay Threatened Confirmed Total 50 Test Statistics a Price willing to pay Mann Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed).046 a. Grouping Variable: Condition Here you see that the probability of this result under the null hypothesis is given as.046, which is less that.05 and will lead us to conclude that threatened males do engage in more masculine behavior. (SPSS uses a normal approximation, but if you look at Appendix Table E.8 you will see that the critical sum of ranks is 536. From the printout the smaller sum was 534.5, which also leads to rejection of the null hypothesis.) 20.2 Wilcoxon s Matched-Pairs Signed-Ranks Test Frank Wilcoxon is credited with developing the most popular distribution-free test for independent groups, which I referred to as the Mann Whitney test to avoid confusion and because of their work on it. He also developed the most popular test for matched groups (or paired scores). This test is the distribution-free analogue of the t test for related samples. It tests the null hypothesis that two related (matched) samples were drawn either from identical populations or from symmetric populations with the same mean. More specifically it tests the null hypothesis that the distribution of difference scores (in the population) is symmetric about zero. This is the same hypothesis tested by the corresponding t test when that test s normality assumption is met. The logic behind Wilcoxon s matched-pairs signed-ranks test is straightforward and can be illustrated with an example of a study of schizophrenia and subcortical structures by Suddath, Christison, Torrey, Casanova, and Weinberger (1990). Bleuler (1911) originally described schizophrenia as being characterized by a lack of connections between associations in memory. The hippocampus has

14 20.2 Wilcoxon s Matched-Pairs Signed-Ranks Test 549 been suggested as playing an important role in memory storage and retrieval, and it is reasonable to ask if differences in hippocampal structures (particularly size) could play a role in schizophrenia. Suddath obtained MRI scans on the brains of 15 schizophrenic individuals and their monozygotic (identical) twins. They measured the volume of each brain s left hippocampus. Because there are many things that control the volume of cortical and subcortical structures, Suddath used monozygotic twin pairs in an effort to control as many of these as possible and to reduce the amount of variance to be explained. The results appear in Table 20.3 as taken from Ramsey and Schafer (1996). Definition Wilcoxon s matched-pairs signed-ranks test: A nonparametric test for comparing the central tendency of two matched (related) samples. If you plot the difference scores for these 15 twin pairs, as shown in Figure 20.1, you will note that the distribution is far from normal. With so few observations it is not feasible to make a definitive statement about normality, but I would not like to have to defend the idea that these are normally distributed observations. For that reason I would prefer to rely on a distribution-free test for paired observations, and that test is the Wilcoxon matched-pairs signed-ranks Table 20.3 Data on Volume (in cm 3 ) of Left Hippocampus in Schizophrenic and Nonschizophrenic Twin Pairs Signed Pair Normal Schizophrenic Difference Rank Rank T (Positive ranks) 111 T (Negative ranks) 9

15 550 Chapter 20 Nonparametric and Distribution-Free Statistical Tests Histogram of Difference Scores Difference Std. Dev. =.24 Mean =.20 N = Figure 20.1 Distribution of differences between schizophrenic and normal twins test, which is based, as its name suggests, on the ranks of the differences rather than the numerical values. If schizophrenia is associated with lower (or higher) volume for the left hippocampus, we would expect most of the twin pairs to show a lower (or higher) volume for the schizophrenic twin than for the control twin. Thus we would expect a predominantly positive (or negative) difference. We also would expect that twin pairs who broke this pattern to differ only slightly in the opposite direction from the trend. On the other hand, if schizophrenia has nothing to do with volume, we would expect about one-half of the difference scores to be positive and one-half to be negative, with the positive differences about as large as the negative ones. In other words, if H 0 is really true, we would no longer expect most changes to be in the predicted direction with only small changes in the unpredicted direction. Notice that I have deliberately phrased this paragraph for a two-tailed (nondirectional) test. For a directional test you would simply remove the phrases in parentheses. In carrying out the Wilcoxon matched-pairs signed-ranks test we first calculate the difference score for each pair of measurements. We then rank all difference scores without regard to the sign of the difference, give the algebraic sign of the differences to the ranks themselves, and finally sum the positive and negative ranks separately. The data in Table 20.3 present the numerical scores 1in cm 3 2 for the 15 schizophrenic participants and their twin in columns two and three. The fourth column shows the differences between the twins, with these differences ranked (without regard to sign) in the fifth column. Although the difference for pair 2 is the smallest number in column four , and would normally be ranked 1, when we drop its sign and look only at the size of the difference , and not its direction, it is the ninth-smallest difference. The last column shows the ranks found in column five with the sign of the difference

16 20.2 Wilcoxon s Matched-Pairs Signed-Ranks Test 551 applied. The test statistic 1T2 is taken as the smaller of the absolute values (i.e., dropping the sign) of the two sums and is evaluated against Table E.7 in Appendix E. (It is important to note that in calculating T we attach algebraic signs to the ranks only for convenience. We could just as easily, for example, circle those ranks that went with lower volume for the normal twin and underline those that went with higher volume for the normal twin. We are merely trying to differentiate between the two cases.) For the data in Table 20.3 only one of the pairs had the normal twin with a smaller volume that the schizophrenic twin. Although that was the 9th largest difference, it was still only one case. All other pairs showed a difference in the other direction. The sum of the positive ranks 1T and the sum of the negative ranks 1T Because T is defined as the smaller absolute value of T1 and T2, T 5 9. To evaluate T, we refer to Table E.7, a portion of which is shown in Table The format of this table is somewhat different from that of the other tables we have seen. The easiest way to understand what the entries in the table represent is by way of an analogy. Suppose that to test the fairness of a coin, you are going to flip it eight times and reject the null hypothesis, at a5.05 (one-tailed), if there were too few heads. Out of eight flips of a coin there is no set of outcomes that has a probability of exactly.05 under H 0. The probability of one or fewer heads is.0352, and the probability of two or fewer heads is Thus if we want to work at a5.05, we can either reject for one or fewer heads, in which case the probability of a Type I error is actually.0352 (less than.05), or we can reject for two or fewer heads, in which case the probability of a Type I error is actually.1445 (much greater than.05). Do you see where we are going? The same kind of problem arises with T because it is a discrete distribution. No value has a probability of exactly the desired a. In Table E.7 we find that for a one-tailed test at a5.025 (or a two-tailed test at a5.05) with n 5 15 the entries are 25 [.0240] and 26 [.0277]. This tells us that if we want to work at a (one-tailed) a5.025 (and thus a two-tailed test at a5.05), we can reject H 0 either for T # 25 (in which case a actually equals.0240) or for T # 26 (in which case the true value of a is.0277). Because we want a two-tailed test, the probabilities should be doubled to 25 [.0480] and 26 [.0554]. We obtained a T value of 9, so we would reject H 0, whichever cutoff we choose. We will conclude, therefore, that we reject the null hypothesis of equal volumes for the left hippocampus for both schizophrenic and normal participants. We can see from the data that the left hippocampus is generally smaller in those suffering from schizophrenia. This is a very important finding if only in that it demonstrates that there is a physical basis underlying schizophrenia, and not simply mistaken ways of living. Ties Ties can occur in the data in two different ways. One way would be for a twin pair to have the same scores for both the normal and schizophrenic twin, leading to a difference score of zero, which has no sign. In that case we normally eliminate that

17 552 Chapter 20 Nonparametric and Distribution-Free Statistical Tests Table 20.4 Critical Lower-Tail Values of T and Their Associated Probabilities (Abbreviated Version of Table E.7) Nominal (One-Tailed) N T T T T pair from consideration and reduce the sample size accordingly, although this leads to some bias in the test. We could have tied difference scores that lead to tied rankings. If both tied scores have the same sign, we can break the tie in any way we want (or assign tied ranks) without affecting the final outcome. If the scores have opposite signs, we normally assign tied ranks and proceed as usual. The Normal Approximation Just as with the Mann Whitney test, when the sample size is too large (in this case, larger than 50, which is the limit for Table E.7), a normal approximation is available to evaluate T. For larger sample sizes we know that the sampling distribution is approximately normally distributed with and n1n 1 12 Mean 5 4 Standard error 5 B n1n n

18 20.3 Kruskal Wallis One-Way Analysis of Variance 553 Thus we can calculate z as z 5 B n1n 1 12 T 2 4 n1n n and evaluate z using Table E.10. The procedure is directly analogous to that used with the Mann Whitney test and will not be repeated here. Frank Wilcoxon ( ) Frank Wilcoxon is an interesting person in statistics for the simple reason that he was not really a statistician and didn t publish any statistical work until he was in his 50s. He was originally trained in inorganic chemistry and spent most of his life doing chemical research dealing with insecticides and fungicides. Wilcoxon had been in a statistical study group with W. J. Youden, an important early figure in statistics, and they had worked their way through Fisher s very influential text. But when it came to analyzing data in later years, Wilcoxon was not satisfied with Fisher s method of randomization of observations. Wilcoxon hit upon the idea of substituting ranks for raw scores, which allowed him to work out the distribution of various test statistics quite easily. His use of ranks stimulated work on inference based ranks on and led to a number of related statistical tests applied to ranks. Wilcoxon officially retired in 1957, but then joined Florida State University and worked on sequential ranking methods until his death. His name is still largely synonymous with rank-based statistics Kruskal Wallis One-Way Analysis of Variance The Kruskal Wallis one-way analysis of variance is a direct generalization of the Mann Whitney test to the case in which we have three or more independent groups. As such it is the distribution-free analogue of the one-way analysis of variance discussed in Chapter 16. It tests the hypothesis that all samples were drawn from identical populations and is particularly sensitive to differences in central tendency. Definition Kruskal Wallis one-way analysis of variance: A nonparametric test analogous to a standard one-way analysis of variance.

19 554 Chapter 20 Nonparametric and Distribution-Free Statistical Tests To perform the Kruskal Wallis test, we simply rank all scores without regard to group membership and then compute the sum of the ranks for each group. The sums are denoted by R j. If the null hypothesis were true, we would expect the R j s to be more or less equal (aside from differences due to the size of the samples). A measure of the degree to which the s differ from one another is provided by where 12 R2 j H 5 N1N 1 12 a 2 31N 1 12 n j n j 5 the number of observations in the jth group R j 5 the sum of the ranks in the jth group N 5 n j 5 total sample size R j and the summation is taken over all k groups. H is then evaluated against the distribution on k 2 1 df. x 2 Students frequently have problems with a statement such as H is then evaluated against the x 2 distribution on k 2 1 df. All that it really means is that we treat H as if it were a value of x 2 and look it up in the chi-square tables on k 2 1 df. For an example, assume that the data in Table 20.5 represent the number of simple arithmetic problems (out of 85) solved (correctly or incorrectly) in one hour by participants given a depressant drug, a stimulant drug, or a placebo. Notice that in the Depressant group three of the participants were too depressed to do much of anything and in the Stimulant group three of the participants ran up against the limit of 85 available problems. These data are decidedly nonnormal, and we will convert the data to ranks and use the Kruskal Wallis test. The calculations are shown in the lower part of the table. The obtained value of H is 10.36, which can be treated as a x 2 on df. The critical value of x is found in Table E.1 in the Appendices to be Because , we can reject H 0 and conclude that the three drugs lead to different rates of performance. (Like other chi-square tests, this test rejects for large values of H. It is nonetheless a nondirectional test.) H Friedman s Rank Test for k Correlated Samples The last test to be discussed in this chapter is the distribution-free analogue of the one-way repeated-measures analysis of variance, Friedman s rank test for k correlated samples. It was developed by the well-known economist Milton Friedman in

20 20.4 Friedman s Rank Test for k Correlated Samples 555 Table 20.5 Kruskal Wallis Test Applied to Data on Problem Solving Depressant Stimulant Placebo Score Rank Score Rank Score Rank R i H 5 x N1N 1 12 a k i R 2 i n i 2 31N 1 12 a b the days before he was a well-known economist. This test is closely related to a standard repeated-measures analysis of variance applied to ranks instead of raw scores. It is a test on the null hypothesis that the scores for each treatment were drawn from identical populations, and it is especially sensitive to population differences in central tendency. Definition Friedman s rank test for k correlated samples: A nonparametric test analogous to a standard one-way repeated-measures analysis of variance. We will base our example on a study by Foertsch and Gernsbacher (1997), who investigated the substitution of the genderless word they for he or she. With the decrease in the acceptance of the word he as a gender-neutral pronoun, many writers are using the grammatically incorrect they in its place. (You may have noticed that in this text I have very deliberately used the less-expected pronoun, such as he for nurse and she for professor, to make the point that profession and gender are not linked. You may also have noticed that you sometimes stumbled over some of those sentences, taking longer to read them. That is what Foertsch

21 556 Chapter 20 Nonparametric and Distribution-Free Statistical Tests Table 20.6 Data on Reading Times as a Function of Pronoun Participant Expect He/See She Expect She/See He Neutral/See They and Gernsbacher s study was all about.) Foertsch and Gernsbacher asked participants to read sentences like A truck driver should never drive when sleepy, even if (he/she/they) may be struggling to make a delivery on time, because many accidents are caused by drivers who fall asleep at the wheel. On some trials the words in parentheses were replaced by the gender-stereotypic expected pronoun, sometimes by the gender-stereotypic unexpected pronoun, and sometimes by they. For our purposes the dependent variable will be taken as the difference in reading time between sentences with unexpected pronouns and sentences with they. There were three kinds of sentences in this study, those in which the expected pronoun was male, those in which it was female, and those in which it could equally be male or female. There are several dependent variables I could use from this study, but I have chosen the effect of seeing she when expecting he, the effect of seeing he when expecting she, and effect of seeing they when the expectation is neutral. (The original study is more complete than this.) The dependent variable is the reading time/character (in milliseconds). The data in Table 20.6 have been created to have roughly the same medians as the authors report. Here we have repeated measures on each participant, because each participant was presented with each kind of sentence. Some people read anything more slowly than others, which is reflected in the raw data. The data are far from normally distributed, which is why I am applying a distribution-free test. For Friedman s test the data are ranked within each subject from low to high. If it is easier to read neutral sentences with they than sentences with an unexpected pronoun, then the lowest ranks for each participant should pile up in the Neutral category. The ranked data follow. Raw Data Participant Sum Expect He/See She Expect She/See He Neutral/See They If the null hypothesis were true, we would expect the rankings to be randomly distributed within each subject. Thus one participant might do best on sentences with an expected he, another might do best with an expected she, and a third

22 20.5 Measures of Effect Size 557 might do best with an expected they. If this were the case, the sum of the rankings in each condition (row) would be approximately equal. On the other hand, if neutral sentences with they are easiest, then most participants would have their lowest ranking under that condition, and the sum of the rankings for the three conditions would be decidedly unequal. To apply Friedman s test, we rank the raw scores for each participant separately and then sum the rankings for each condition. We then evaluate the variability of the sums by computing x 2 F 5 12 Nk1k 1 12 a R2 j 2 3N1k 1 12 where R j 5 the sum of the ranks for the jth condition N 5 the number of subjects k 5 the number of conditions and the summation is taken over all k conditions. This value of with respect to the standard x 2 distribution on k 2 1 df. For the data in Table 20.5 we have x 2 F 5 12 Nk1k 1 12 a R2 j 2 3N1k can be evaluated The critical value of x 2 on df is 5.99, so we can reject H 0 and conclude that reading times are not independent of conditions. People can read a neutral sentence with they much faster than they can read sentences wherein the gender of the pronoun conflicts with the expected gender. From additional data that Foertsch and Gernsbacher present, it is clear that they is easier to read than the wrong gender, but harder than the expected gender. x 2 F 20.5 Measures of Effect Size Measures of effect size are difficult to find with distribution-free statistical tests. 1 An important reason for this is because many of our effect size measures are based on the size of the standard deviation, and if the data are very badly (nonnormally) 1 Conover (1980) discusses the use of confidence intervals for nonparametric procedures.