Testing for differences I exercises with SPSS Introduction The exercises presented here are all about the t-test and its non-parametric equivalents in their various forms. In SPSS, all these tests can be found in the Compare Means sub-menu in the Analyze pull-down menu. The data sets in the exercises can be downloaded from the course web page. Before testing Since the t-test is built on certain model assumptions, the data should be inspected before a t-test is performed, to ensure that the model assumptions are valid for the data set at hand. Therefore, make it a habit to use the Explore function of SPSS to get an overview of the properties of your data. You will find the Explore function in the Descriptive Statistics sub-menu of the Analyze pulldown menu. In the Explore dialogue box, click the Plots button and tick the Histogram and Normality plots with tests options before you proceed. The Normality plots with tests option will perform two tests of normality: a Kolmogorov-Smirnov test with Lilliefors significance correction and a Shapiro-Wilk test. The Lilliefors correction is used to adjust for the fact that the true mean and standard deviation of the hypothesised normal distribution are unknown and have to be estimated from the sample before the Kolmogorov-Smirnov test can be applied. The Kolmogorov- Smirnov test is also available from the Nonparametric Tests sub-menu of the Analyze pull-down menu, but this version of the test does not use the Lilliefors correction (and should therefore not be used unless the parameters of the hypothesised normal distribution are specified beforehand). Mercury in pike This is an example of a single-sample situation where the mean under the null hypothesis is specified beforehand. The population of pike in a lake was investigated for its content of mercury (Hg). A sample of 10 pike of a certain size was caught and the concentration of mercury was determined (unit: mg/kg). The data are stored in the file PikeData.sav. 1. Explore the data both graphically and through summary statistics, and check whether it is reasonable to assume that the data are normally distributed. 2. Test the null hypothesis H 0 : μ = 0.9 against the alternative H 1 : μ > 0.9. a. What kind of alternative hypothesis is this, and what implications does it have for the test? 1
b. Compare the sample mean with the mean specified by the null hypothesis. What is the p-value for the difference? (Remember the formulation of the alternative hypothesis.) c. Can the null hypothesis be rejected at the 0.05 level of significance? d. Why is df = N-1? 3. Test the null hypothesis H 0 : μ = 1.1 against the alternative H 1 : μ < 1.1. a. Compare the sample mean with the mean specified by the null hypothesis. What is the p-value for the difference? b. Is the difference significant at the 0.05 level? 4. Compare the results from the two tests above. Have we been able to prove (or disprove) anything with any degree of certainty? Do you have any suggestions on how to improve the situation? 5. What is the difference in point of departure in the two sets of hypotheses presented above? a. Which set of hypotheses would you choose if you were a fishmonger selling pike from this lake? b. Which set of hypotheses would you choose if you were a cautious customer? Petrol campaign This exercise illustrates the difference between the related and the unrelated t test. A campaign to motivate citizens to reduce the consumption of petrol was planned. Before the campaign was launched, an experiment was carried out to evaluate the effectiveness of such a campaign. For the experiment, the campaign was conducted in a small but representative geographical area. Twelve families were randomly selected from the area, and the amount of petrol (unit: litre) they used was monitored for 1 month prior to the advertising campaign and for 1 month following the campaign. Unfortunately, the variable identifying the different families was lost during a data conversion from one format to another, so you will have to treat the data as two independent samples. 1. Load the data set PetrolCampaignData1.sav. 2. Explore the data both graphically and through summary statistics, and check whether it is reasonable to assume that the data are normally distributed. 3. Formulate a suitable pair of null and alternative hypotheses. 4. Test your hypothesis on the 5% level. 5. Calculate the effect size for the test. 6. What is your conclusion regarding the efficacy of the experimental campaign? 2
By a stroke of luck, the original data file was found again, so the data analysis can now be carried out according to the original design of the experiment. However, in SPSS, the paired samples t-test requires data in a different format than the format used in the independent samples t-test. 1. Follow the Paired-samples T Test section in the Tutorial to see how it requires the data to be presented. 2. Load the data set PetrolCampaignData2.sav. 3. Explore the data graphically (now that you can identify the pairs in the data set) and check if they can be regarded as normally distributed. 4. Test your hypothesis on the 5% level. 5. Compare the results from the two test situations: Do you reach the same conclusion? If not, how can you explain the difference? Which of the tests would you regard as most appropriate? 6. Go to the Transform pull-down menu and select Compute Variable to compute a new variable Diff which has the value After Before. Now perform a one-sample t-test on the data set consisting of the column of pairwise differences. What value should you put in the Test Value box? (Remember the null hypothesis ) What do you get? (Compare the result with what you obtained from the paired samples t-test.) Diet study A health psychologist wanted to evaluate the effects of a particular diet on weight. Thirty-five obese male volunteers were randomly selected and put on the diet for 3 months. Baseline and end-of-program weights (unit: pound) were recorded for each subject. 1. Open the Diet Study data set. 2. Formulate a suitable pair of null and alternative hypotheses. 3. Which test corresponds to the design of the study? 4. Test your hypothesis on the 5% level. 5. What is your conclusion regarding the effect of the diet? Do you trust the result of the test? Tabletop hockey data Now that you know how to do a t-test, you can actually test if there is any significant difference in shot distance depending on the shot type. 1. Open the tabletop hockey data set. 2. Formulate a suitable pair of null and alternative hypotheses. 3. Which test corresponds to the design of the study? 4. Select the cases from group B and test your hypothesis on the 5% level of significance. 3
5. What is your conclusion regarding the effect of the shot type? Do you trust the result of the test? 6. Select the cases from group F and test your hypothesis on the 5% level. 7. What is your conclusion regarding the effect of the shot type for this group? Do you trust the result of the test? Labour force participation rate of women This dataset contains the labour force participation rate (LFPR) of women in 19 cities in the United States in each of the years 1968 and 1972. The data help to measure the growing presence of women in the labour force over this period. It may seem reasonable to compare LFPR rates in the two years with a pooled t-test since the United States did not change much from 1968 to 1972. 1. Load the data set LaborForceData1.sav. 2. Explore the data both graphically and through summary statistics, and check whether it is reasonable to assume that the data are normally distributed. 3. Formulate a suitable pair of null and alternative hypotheses. 4. Test your hypothesis on the 5% level. 5. What is your conclusion regarding the change in LFPR? However, the data are naturally paired because the measurements were made in the same cities for each of the two years. It is better to compare each city in 1972 to its own value in 1968. 1. Load the data set LaborForceData2.sav. 2. Explore the data both graphically and through summary statistics, and check whether it is reasonable to assume that the data are normally distributed. 3. Test your hypothesis on the 5% level. 4. What is your conclusion regarding the change in LFPR? 5. Compare the results from the two test situations: Do you reach the same conclusion? If not, how can you explain the difference? Since the Kolmogorov-Smirnov test of normality gave a highly significant result for this data set (LaborForceData2.sav), it would be wise to redo the analysis of the data with a more appropriate test. Compare your new results with the results from the t-test do you reach the same conclusion? Left-handers and right-handers A psychologist who was interested in determining whether left-handed and right-handed people differ in spatial ability constructed a test that measures spatial ability. The test was administered to two randomly selected groups, 10 left-handers and 10 right-handers, from the students at the 4
university where she worked. The scores are stored in the data file HandednessData.sav. A higher score indicates better spatial ability. (Note that one of the subjects did not show up for the testing.) Formulate a null and an alternative hypothesis and test the null hypothesis with a suitable test. Promotion of attitudes A major food company conducted an experiment to assess whether a film designed to tell the truth about, and also promote more favourable attitudes toward, genetically modified (GM) foods really would result in more favourable attitudes. Twelve persons participated in a replicated measures design. In the before condition, each subject filled out a questionnaire designed to assess attitudes toward GM foods. In the after condition, the subjects saw the film, after which they filled out the questionnaire. The scores are stored in the file AttitudesData.sav. High scores indicate more favourable attitudes toward GM foods. Formulate a suitable pair of hypotheses and test the null hypothesis on the 5% level of significance. What is your conclusion? Answers to the questions Mercury in pike This example illustrates how different formulations of the null and alternative hypothesis, respectively, in certain cases can reflect fundamentally different points of view and how this will influence the conclusions drawn from the test. 1. The p-value for the Kolmogorov-Smirnov test of normality is not less than 0.200 and the p- value for the Shapiro-Wilk test is 0.954, which are both greater than 0.05 (our standard level of significance), so we do not reject the null hypothesis that the data are normally distributed (which means that we keep this hypothesis and continue to treat the data as normally distributed). 2. Note that this is a one sample situation. a. This is a one-sided alternative, which implies that a one-tailed test should be used. Since the alternative is that the population mean is greater than 0.9, it is the area cut off from the upper tail which will correspond to the p-value. b. The sample mean is 0.970, while the mean value specified by the null hypothesis is 0.9 so the difference is 0.07. The p-value for this difference, using the one-tailed test, is 0.2595 (which is half the p-value for the two-tailed test, which SPSS reports to be 0.519). c. No, the null hypothesis cannot be rejected at significance level α = 0.05, since p = 0.2595 > 0.05. d. df = N-1 = 10-1 = 9 because one degree of freedom is consumed by estimating the mean value prior to estimating the standard deviation (which is used in the t-test statistic). 5
3. It is still a one-sided alternative hypothesis, but because the inequality is now turned the other way round, compared with the previous case, we now have to look at the lower tail in the one-tailed test. a. The sample mean is still 0.970 (it s the same sample), while the mean value specified by the null hypothesis now is 1.1 so the difference is -0.13. The p-value for this difference, using the one-tailed test, is 0.1225 (which is half the p-value for the twotailed test, which SPSS reports to be 0.245). b. No, the difference is not significant at significance level α = 0.05, since p = 0.1225 > 0.05. 4. We can not reject the null hypothesis that μ is equal to 0.9 (or less, actually, since we had a one-sided alternative hypothesis), and neither can we reject the null hypothesis that μ is equal to 1.1 (or greater, since we in this case had a one-side alternative hypothesis turning the other way). Thus, there is not evidence enough to conclude that the average mercury concentration is greater than 0.9, and neither is there evidence with sufficient weight to say that it is less than 1.1. A rather inconclusive result, it seems. This could be due to lack of power for the test. The discriminating power of the test can be improved by increasing the sample size. 5. In the first case, we keep the hypothesis that the average mercury concentration is less than or equal to 0.9 unless the data lead us to reject this hypothesis. In the second case, we keep the hypothesis that the average mercury concentration is greater than or equal to 1.1, unless the data lead us to reject this hypothesis. The first position is optimistic : we don t believe that the mercury concentration is very high, until it is proved by sufficient evidence. The second position is more pessimistic : we stick to the belief that the mercury concentration is rather high, until the contrary is proved. a. If you sell fish from this lake for a living, then you would probably hold on to the first position and keep selling your fish until someone can prove that it is poisonous. b. If you are a wary customer, you would rather play it safe and refrain from buying those fish until someone has proved that they are not poisonous which means that you would take the second position. Petrol campaign First part, the independent samples situation: 2. The two samples should be investigated separately. In both cases (before and after, respectively) the p-value for the Kolmogorov-Smirnov test of normality is not less than 0.200, which is greater than 0.05 (our standard level of significance), so we do not reject the null hypothesis that the data are normally distributed (which means that we keep this hypothesis and continue to treat the data as normally distributed). 3. The null hypothesis would be the usual no effect, that is, the two group means are equal. Since the aim of the campaign is to motivate citizens to reduce their petrol consumption, a reasonable position would be to say that the full campaign will be launched only if the experiment shows that such a campaign, with reasonable certainty, will have the desired effect. In this case, the alternative hypothesis would be there is a reduction in petrol 6
consumption, which means that the mean of the after group is less than the mean of the before group. 4. The p-value for Levene s test for equality of variances is 0.618 which is greater than 0.05 (our standard significance level), so we do not reject the null hypothesis that the variances of the two groups are equal. Thus, we proceed to study the results from the t-test in the output table. The p-value for the two-tailed test is 0.462, but since we have chosen a onesided alternative hypothesis, we should use a one-tailed test. Thus, we should divide the p- value computed by SPSS by two, which gives p = 0.231 > 0.05 (our standard α), and we can not reject the null hypothesis at the 5% level of significance. 5. The effect size can be calculated using the formula in the book, and the answer is 0.157. 6. The effect of the experimental campaign is not statistically significant at the 5% level. Furthermore, the size of the effect is small (by the rules of thumb given in the book). Second part, the matched pairs situation: 3. Since the paired samples t-test is based on the pairwise differences, it is the set of differences which should be tested for normality. The Kolmogorov-Smirnov test gives p = 0.172, which is not significant on the 5% level, so we continue to regard the data as normally distributed. 4. The p-value for the two-tailed test is 0.014, and since we are still working under the same one-sided alternative hypothesis as above, we should divide this by two to obtain the p- value for the corresponding one-tailed test. Since 0.007 < 0.05, the null hypothesis of no effect is rejected in favour of the alternative (i.e. that the experimental campaign has led to a statistically significant reduction in fuel consumption). 5. The experiment was set up according to a paired samples pre-test/post-test design to eliminate the influence of variation between families, which was considered to be a nuisance or noise factor in this case. If the variation between families is considerable, the paired t-test is the appropriate test to use otherwise the variation between families may mask the effect of the treatment (campaign). That is actually what happened in the independent samples case, which did not show a significant effect due to the amount of noise introduced by the variation between families. So, design and type of statistical test do matter! 6. It is the same test performed in two different ways, so the results should be the same (with exception for the sign of the difference and the test statistic, which will depend on which condition was used as baseline). Diet study 2. The null hypothesis is no effect the usual stance of the sceptical scientist. There is no obvious direction of the alternative hypothesis in this case, so it will be two-sided. 7
3. It is a paired samples design, so the test should be a paired samples t-test. 4. The average difference between end-of-program weight and baseline weight is -12.743 with p = 0.000 (i.e. zero to 3 decimal places, which we would usually report as p < 0.001) which is clearly significant on the 5% level. 5. Although the diet has the effect of an average weight reduction of 12.743 pounds, which is statistically significant at the 5% level, this reduction is only a about 5.4% of the average baseline weight. Whether this can be considered to be practically significant has to be evaluated from other criteria. Since the paired samples t-test is based on the pairwise differences, it is the set of differences which should be tested for normality. The baseline and end-of-program data taken separately seem to be quite skewed and have excess kurtosis, and they do not pass the Kolmogorov-Smirnov test on the 0.05 level. However, the pairwise differences are closer to normal and also pass the K-S test on the 0.05 level of significance. It will therefore be appropriate to use the paired samples t-test for the analysis. Tabletop hockey data 2. Null hypothesis: no effect of shot type on distance travelled by the puck, i.e. the population means for the two shot types are equal. The alternative hypothesis depends on the situation. If you are just investigating whether there is any difference, the alternative hypothesis would be: the means for the two shot types are different. 3. Due to the design of the experiment, the independent groups t-test will be the test to use. 4. The data do not pass Levene s test for equality (homogeneity) of variances on the 0.05 level of significance, so the results from the second row of the output table should be used (the row labeled Equal variances not assumed ). 5. Both samples pass the K-S test on the 5% level, so the results from the t-test (adjusted for unequal variances) should be fairly reliable. Since p = 0.085, the null hypothesis will not be rejected on the 5% level. If we had adopted a one-sided alternative hypothesis, then a onetailed test should be used, and we would obtain p = 0.0425, which is significant on the 0.05 level. The average slap shot distance gives a 34% improvement compared with the average drag shot distance. Using the formula in the book, the effect size is r = 0.29 (about 9% of the variation explained), which can be regarded as a medium effect according to the t shirt rule of thumb. 6. The data pass Levene s test for equality of variances (homoscedasticity), so the results from the top row of the table can be used. SPSS delivers a p-value of 0.770, which means that the difference in shot distances is not significant, irrespective of whether a one- or twotailed test is used. 7. The average slap shot distance gives a 4% improvement compared with the average drag shot distance. Using the formula in the book, the effect size is r 0.05 (about 0.2% of the 8
variation explained). Thus, the effect is neither statistically significant nor practically important. The p-value for the K-S test is not less than 0.2 for any of the samples, so the t- test is appropriate for this data set. Labour force participation rate of women The independent t-test: 2. Since the Kolomogorov-Smirnov test gives p 0.2 for both samples, the data can be assumed to be normally distributed, and it will be ok to use the t-test. 3. Null hypothesis: no change, i.e. the population means for the two groups (the population in 1968 and 1972, respectively) are equal. The two-sided alternative hypothesis: the two population means differ. 4. Levene s test for equality of variances is not significant on the 0.05 level, so we can assume equal variances for the two groups and trust the results from top row of the output table. With p = 0.143, the difference is not significant on the specified level of significance (this applies to both the two- and the one-sided test). 5. We cannot conclude that there is any change in LFPR between the two years. The dependent t-test: 2. As can be seen from the scatter plot in Figure 1, the data from the two years are quite strongly correlated, with a large variation between cities. It will therefore be appropriate to use the paired samples t-test, doing the comparison between the two years within each city. The Kolmogorov-Smirnov test of normality gives p = 0.004, which is significant on the 0.05 level. Thus, the distribution of the differences deviates significantly from the normal distribution, and the results from the t-test may not be trustworthy. 3. If we, in spite of the doubt cast by the K-S test, perform a paired samples t-test, we get p = 0.004 for the two-tailed test, which is a highly significant result. It seems that the change in LFPR between the years 1968 and 1972, which was masked by the large variation between cities in the previous test situation, has now become visible. That there is a positive change in the majority of the sampled cities is apparent from the graph of the differences shown in Figure 2, but since we can t fully trust the results from the t-test, we can t really say how significant (in statistical terms) this change is. To cope with data that are not normally distributed, we have to use another test which does not rely on this assumption. More about that later in the course. 9
Figure 1. Labour force participation rate of women in 19 cities in the US in 1968 and 1972. Figure 2. Change in labour force participation rate of women in 19 cities in the US in 1968 and 1972 (with horizontal lines for the zero and mean levels). As mentioned above, the data are naturally paired because the measurements were made in the same cities for each of the two years. Therefore, comparisons between rates in 1972 and 1968 10
should be made within cities i.e. treating the data as matched pairs/repeated measurements. The non-parametric counterpart to the paired t-test is the Wilcoxon signed rank test, which is found in the Nonparametric Tests sub-menu of the Analyze pull-down menu. Since we have two related (correlated, dependent) samples, 2 Related Samples is the item to choose from the submenu. Logically enough, the dialogue box for this analysis (see Figure 3) is very similar in appearance to the dialogue box for the paired samples t-test, and you can see that Wilcoxon is already selected as the default test type. Since we want to compare Rate72 with Rate68, these to variables should be entered in the Test Pairs box. If you want some descriptive statistics in addition to the test, you can select this as an option (click the Options button). Figure 3. Dialogue box for the Two-Related-Samples Tests. Running the test, you get two result tables, one with the ranks (Table 1) and one with the test statistics (Table 2). Ranks N Mean Rank Sum of Ranks Rate68 - Rate72 Negative Ranks 13 a 8,04 104,50 a. Rate68 < Rate72 b. Rate68 > Rate72 Positive Ranks 4 c 7,75 15,50 Ties Total 19 c. Rate68 = Rate72 Table 1. Result table with ranks, mean ranks and sum of ranks from the Two-Related-Samples Test. 11
Z Test Statistics b Rate68 - Rate72-2,539 a Asymp. Sig. (2-tailed),011 a. Based on positive ranks. b. Wilcoxon Signed Ranks Test Table 2. Result table with test statistics from the Two-Related-Samples Test. The mean of negative ranks (8.04) is larger than the mean of positive ranks (7.75), and the difference is statistically significant on the 5% level (p = 0.011). From Table 1 we see that it is the difference Rate68 - Rate72 which has been analysed, which means that negative ranks correspond to an increase in participation rate from 1968 to 1972. Thus, we conclude that there is a significant increase in participation rate from 1968 to 1972. Incidentally, this is the same conclusion that was reached through the paired t-test, but since the Wilcoxon test does not depend on the assumption that the pairwise differences are normally distributed, the result from this test is more reliable in the present situation. Left-handers and right-handers The null hypothesis will be that there is no difference between the two groups, and since we have no further background information, we will take the non-directional alternative hypothesis that there is a difference one way or the other. Running the Explore function we see that p 0.200 for both groups in the Kolmogorov-Smirnov test, so we retain the hypothesis that the data are normally distributed and proceed to the Independent Samples T Test. Looking at the result table from the independent samples test, we see that Levene s test for equality of variances gives p = 0.755, which means that the assumption that the two groups have equal variances is met, and we can trust the results from the t-test. The t-test yields p = 0.072 (two-tailed), so the 5% significance level we do not reject our null hypothesis that the two population means are equal i.e. we conclude that there is no difference in spatial ability between left-handers and right-handers. However, if we take a look at the graphical displays of the data, we see that there is one potential outlier in the left-handed group, namely case 3 with a score of 56 which is substantially below the rest of the scores in the same group. We should investigate the influence of this potential outlier before we reach our final conclusion. To exclude case 3 from the analysis, we can use If condition is satisfied option in the Select Cases dialogue box called from the Data pull-down menu. In the Function Group box select Miscellaneous and then select $Casenum in the Functions and Special Variables box (see Figure 4). To exclude case number 3, we write $CASENUM ~= 3 in the formula box. Running the analysis 12
again with the independent samples t-test, we get p = 0.016, which is clearly significant on the 5% level of significance. Thus, excluding this single potential outlier, we reach quite another conclusion than above. This illustrates the fact that the mean value is sensitive to outliers, and since the t-test is based on a comparison of mean values, the t-test will be sensitive to outliers as well. Figure 4. Using the Select Cases dialogue box to filter out a single case. For the sake of comparison, we can perform the Mann-Whitney U test, which is the nonparametric equivalent of the independent t-test. Since the Mann-Whitney test is based on a comparison of ranks rather than means, it will be more robust with respect to influence from potential outliers. This test is found in the Nonparametric Tests sub-menu of the Analyze pulldown menu. Since we have two independent samples, 2 Independent Samples is the item to choose from the sub-menu. Logically enough, the dialogue box for this analysis (see Figure 5) is very similar in appearance to the dialogue box for the independent t-test, and you can see that Mann-Whitney U is already selected as the default test type. The variable Score is our dependent variable which should be entered in the Test Variable List box. We also need a grouping variable to tell SPSS how to divide the observations into the two groups that we want to compare. SPSS now requires this grouping variable to be numeric, and since our grouping variable Hand is a string variable, we have to compute a new grouping variable which is numeric. This is easily done with the Automatic Recode function in the Transform pull-down menu. When you have created such a variable, you enter it in the Grouping Variable box and then click the Define Groups button. In the Define Groups dialogue box you tell SPSS in which order you want the two groups to be compared (see Figure 6). 13
Figure 5. Dialogue box for the 2 Independent Samples Tests. Figure 6. Dialogue box for defining the groups in the 2 Independent Samples Tests. Running the test, we get two result tables, one with the ranks (Table 3) and one with the test statistics (Table 4). Ranks Hand2 N Mean Rank Sum of Ranks Score left 9 12,72 114,50 right 10 7,55 75,50 Total 19 Table 3. Result table with mean ranks and sum of ranks from the 2 Independent Samples Test. 14
Test Statistics b Score Mann-Whitney U 20,500 Wilcoxon W 75,500 Z -2,001 Asymp. Sig. (2-tailed),045 Exact Sig. [2*(1-tailed Sig.)] a.not corrected for ties.,043 a b.grouping Variable: Hand2 Table 4. Result table with test statistics from the 2 Independent Samples Test. The mean ranks for the two groups are quite different, and this difference is statistically significant on the 5% level, with p = 0.043 for the exact test (two-tailed). This is the same conclusion as was reached with the t-test with the potential outlier excluded, and illustrates the robustness of the Mann-Whitney test with respect to outlier observations. Promotion of attitudes Since this is a replicated measures design with two conditions before and after treatment we can use either the paired samples t-test or its non-parametric equivalent, the Wilcoxon signed rank test, to test the effect of the treatment (the film). The null hypothesis is that the treatment does not have any effect, and if we assume that this experiment is conducted in order to decide whether to launch the film or not, we can adopt the alternative hypothesis that exposure to the film will result in more favourable attitudes. This means that we will only reject the null hypothesis if we see a significant effect towards positive results (an increase in mean score). We would prefer to use a parametric test, since this usually will have more power than its nonparametric equivalent. To test if the assumptions for the paired t-test are met, we perform an exploratory analysis on the difference scores. Although the histogram looks rather skewed, the Kolmogorov-Smirnov test does not indicate a significant deviation from normality (p 0.200). If we thus decide to continue with the paired samples t-test, we see that the mean difference is -4.583 (before after) which is not significant on the 5% level (p = 0.053). 15