Statistical Analysis of Independent Groups in SPSS

Transcription

1 Statistical Analysis of Independent Groups in SPSS Peter Samuels 30 th October 2015 Based on materials provided by Coventry University and Loughborough University under a National HE STEM Programme Practice Transfer Adopters grant Overview Lab session teaching you how to analyse differences in the means/medians of two or more independent samples of a single scale variable Common student activity Self contained: only a finite number of possibilities 1

2 Workshop outline Two groups: Descriptives Assumption checking (for parametric tests) Independent samples t-test Mann Whitney U test Several groups: Descriptives Assumption checking (for parametric tests) One-way ANOVA Kruskall Wallis test Post hoc testing The data analysis process for 2 independent groups Descriptive statistics Pass Assumption checking Fail Parametric testing: t-test Nonparametric testing: Mann-Whitney U test 2

3 Example 1: 2 stool designs A research project involving two different designs of stool Tested by 40 people Each person was assigned to assess one product, providing in an overall performance score out of people per stool Create an error bar chart Open the file TwoStools.spv Graphs > Legacy Dialogs > Error Bar Click on Define Put PerformanceScore as the Variable and Design as the Category Axis Click OK Go to the output window 3

4 Interpretation: Confidence intervals of the means of the performance scores Means of samples are the circles 95% confident means of populations lie between whiskers As the intervals overlap we should suspect the test will come back negative (informal, not failsafe!) Also observe the intervals are roughly equal Robustness Parameter-based statistical tests make certain assumptions in their underlying models However, they often work well in other situations where these assumptions are violated This is known as robustness Robustness conditions depend upon the test being used There are different opinions on robustness conditions 4

5 Assumption checking Parametric tests are more sensitive than nonparametric tests but require certain assumptions to hold to be used Thus we need to check these assumptions first Not required with this test for equal group sizes 25 due to robustness exceptions (Sawilowsky and Blair, 1992) Here our groups were equal but only of size 20 so we need to test for normality For small sample sizes the best test is Shapiro-Wilk Reference: Sawilowsky, S. S. and Blair, R. C. (1992) A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. Psychological Bulletin, 111(2), pp Assumption checking in SPSS Analyze > Descriptive Statistics > Explore Put PerformanceScore in the Dependent List and Design in the Factor List and select Plots Remove Stemand-leaf, select Histogram and Normality plots with tests 5

6 Add a fitted normal curve to the histograms Double click on the first histogram in the output window this opens the Chart Editor window Select this button Close the Properties widow and the Chart Editor window Repeat with the other histogram Design 1 histogram appears to be approximately normally distributed Design 2 histogram appears to be a bit skewed to the right. However its skewness < twice its standard error. 6

7 The null and alternative hypotheses Statistical testing is about making a decision about the significance of a data feature or summary statistic We usually assume that this was just a random event then seek to measure how unlikely such an event was The statement of this position is known as the null hypothesis and is written H 0 In statistical testing we make a decision about whether to accept or reject the null hypothesis based on the probability (or P- ) value of the test statistic The logical opposite of the null hypothesis is known as the alternative hypothesis 7

8 Standard significance levels and the null hypothesis (H 0 ) P-value of test statistic Significant? Formal action > 0.1 No Retain H 0 < 0.1 and > 0.05 < 0.05 and > 0.01 < 0.01 and > < No Retain H 0 Yes: at 95% Yes: at 99% Yes: at 99.9% Reject H 0 at 95% confidence Reject H 0 at 99% confidence Reject H 0 at 99.9% confidence Informal interpretation No evidence to reject H 0 Weak evidence to reject H 0 Evidence to reject H 0 Strong evidence to reject H 0 Very strong evidence to reject H 0 Example Chris Froome Plebgate libel trial Climate change Plebgate police trial Higgs boson The Shapiro-Wilk test is negative for both designs as the Sig. (or probability) values are both > 0.05 Therefore we can use the appropriate parametric test (the independent samples t-test) Both these tests are not very sensitive with small sample sizes and over sensitive with larger samples (e.g. > 100) For large samples the probability values should be interpreted alongside the histograms with fitted normal curves and Q-Q plots see normality checking sheet 8

9 The independent samples t-test Applies to the different (independent) subjects with one scaled-based data value Tests the difference between the means of the two samples The samples can be different sizes Here: Product scores for Designs 1 and 2 Assumes normality Null hypothesis (H 0 ): The means of the performance scores for the two designs are equal Two variants: Depends upon whether the variances of the two designs can be assumed to be equal (use Levene s test first, H 0 : Variances are equal) all done together in SPSS Analyze > Compare Means > Independent Samples T- Test Add PerformanceScore as the Test Variable and Design as the Grouping Variable Select Define Groups and add 1 for Group 1 and 2 for Group 2 9

10 Automatically computes Levene s test and outputs both versions: Not significant at 95% (so we retain H 0 ) So equal variances can be assumed (now look at this row) t-test significant at 95% (between 0.05 and 0.01) Interpretation: There is evidence that the mean performance scores for the stool designs are different (this is different from our informal interpretation of the error bar chart) Nonparametric testing A type of statistical inference which does not make any assumptions about the data coming from a distribution Often applies to category-based data (nominal and ordinal) but can also apply to scale-based data if test assumptions are not met Advantage: no need to check assumptions Disadvantages: Results are generally less sensitive (higher p- values) Cannot handle more complex data structures (such as two-way ANOVA) Appropriate test here is the Mann-Whitney U test 10

11 Mann-Whitney U Test A non-parametric test of two independent samples of ordinal or scale-based data Tests whether there is an increasing/decreasing relationship between two samples Need at least about 10 data categories for ordinal variables, otherwise use the Chi-squared test Alternative to a independent samples t-test for scalebased data if the assumptions are not met (not the case here just shown for illustration purposes) Samples can be different sizes Null hypothesis: Design 2 performance scores are equally likely to be higher or lower than Design 1 performance scores Running the Mann-Whitney U test Select: Analyze Nonparametric Tests Independent Samples On the Fields tab, add PerformanceScore in the Test Fields list and Design in the Groups list Select Run 11

12 The correct test has been run The Sig. value is about the same as the value for the independent samples t test (expected it to be higher) Helpfully states the null hypothesis decision Unhelpfully states the default significance level used (can be misleading) The data analysis process for several independent groups Descriptive statistics Pass Assumption checking Fail Parametric testing Nonparametric testing Significant differences Post hoc testing 12

13 Example 2: 3 stool designs A research project involving three different designs of a new product Tested by 60 people Each person was assigned to assess one product, providing in an overall performance score out of people per product Open the file ThreeStools.spv Create descriptive statistics and an error bar chart as before What is one-way ANOVA? An extension of t-tests to several groups Usually independent measures Accounts for variations both within and between groups 95% confidence intervals for 3 groups of measurements These confidence intervals do not overlap, but does this mean we can conclude they are not all from the same population? 13

14 Initial observations There appear to be differences between the sample means, i.e. variation between groups But there is also variation within groups Can we conclude that there are differences between groups (i.e. that they come from population with different means)? We need a systematic objective approach this is known as ANOVA Called ANOVA from ANalysis Of VAriance (The name is a bit confusing because it sounds like a variance test, not a means test) Introduction to ANOVA Better than doing lots of two sample tests, e.g. 6 groups would require 15 two sample tests For every test, there is a 0.05 probability that we reject H 0 when it should be retained (assuming H 0 is true) Doing several tests increases the probability of making a wrong inference of significance (Type I error) E.g. the probability of a Type I error with 6 groups, assuming they are all equally randomly distributed is = = 0.537, i.e. more than 1 in 2 14

15 The ANOVA model y ij m y ij denotes performance score for the j th measurement of the i th design The parameter m i denotes how the performance score for design i differs from the overall mean μ e ij denotes the error (or residual) for the j th measurement of the i th design The ANOVA model assumes that all these errors are normally distributed with zero mean and equal variances i e ij Test hypothesis In our example, we need to test the hypothesis: H 0 : m 1 = m 2 = m 3 = 0 Or, more simply, that the product score population means are the same. Intuitively, this is done by looking at the difference between means relative to the difference between observations, i.e. is the mean-to-mean variation greater than what you would expect by chance? 15

16 Assumptions (Similar to the independent sample t-test assumptions) 1. The measurements for each group are normally distributed. However, if there are many groups there is a danger of Type I errors. 2. The errors for the whole data set are normally distributed (this theoretically follows from Assumption 1, but it is worth testing separately with small samples). To calculate these errors we first need to estimate the group means. 3. The variances of each group are equal (we can still use a version of ANOVA even if this one fails) Assumption 1: Check normality of each group No evidence that individual groups are not normally distributed 16

17 Assumption 2: Testing errors for normality First create the residuals Select Analyze > General Linear Model > Univariate Add the variables as shown Select Save Choose Unstandardised residuals Based on estimates of m i Select Analyze > Descriptive Statistics > Explore Add the residual variable as shown but with no factor Select Plots and Histogram and Normality plots with tests as before Then add a normal curve to the histogram as before 17

18 Evidence that the residuals are not normally distributed from the Shapiro-Wilk test (p < 0.05) even though the degrees of freedom have been reduced slightly. The Kolmogorov-Smirnov test is even more significant. Kurtosis (peakedness) looks a bit high Formally we should compare the absolute value of the kurtosis with twice its standard error this is significant as it is higher 18

19 Assumption 3: Equal variances Analyze > Compare Means > One-Way ANOVA Add PerformanceScore to the Dependent List and Design as the factor Select Options and Homogeneity of variance test Carries out a Levene s test for homogeneity of variance (similar to the t-test) Null hypothesis: The variances are equal Significant at 95% (p-value < 0.05) so we have evidence to reject assumption of equality of variances 19

20 Robustness of ANOVA ANOVA is quite robust to changes in skewness but not to changes in kurtosis. Thus, it should not be used when: Kurtosis > 2 Standard Error of Kurtosis for any group or the errors. Otherwise, provided the group sizes are equal and there are at least 20 degrees of freedom, ANOVA is quite robust to violations of its assumptions However, the variances must still be equal Source: Field, A. (2013) Discovering Statistics using SPSS. 4 th edn. London: SAGE, pp Robustness calculation Group Kurtosis Standard Error of Kurtosis Condition met Design Yes Design Yes Design Yes Errors No Group sizes are equal Total degrees of freedom = = 59 > 20 Also standard ANOVA cannot be used because the variances are not equal 20

21 Summary of findings: ANOVA assumptions Assumption Finding 1. Normality of groups No evidence of non-normality 2. Normality of errors Evidence of non-normality 3. Equality of variances Evidence of non-equality Robustness Kurtosis of errors too high One-way ANOVA If all 3 assumptions (or the robustness exceptions to nonnormality) are OK then use standard one-way ANOVA Analyze > Compare Means > One-Way ANOVA Under Options select Descriptive Shown for illustration purposes 21

22 Significance level < So there is very strong evidence of differences in performance score between the three designs What if these assumptions are in doubt? If normality assumptions (or their robust exceptions) are in doubt: Use a nonparametric test: Kruskal-Wallis or median if there is no trend in the groups or Jonckheere- Terpstra if you are looking for a trend (e.g. mean of group 1 < mean of group 2 < mean of group 3, etc.) Available under Analyze Nonparametic Tests Independent Samples If equality of variances assumption in doubt: Use the Brown-Forsythe or Welch test Select ANOVA and click on Options button and select the Brown-Forsythe and Welch options 22

23 Nonparametric one-way ANOVA We should use the Kruskal-Wallis or median tests as there is no trend to observe between these designs The median test is cruder than Kruskal-Wallis and should only be preferred when ranges of extreme values have been summarised together, which was not the case here (see Select Analyze Nonparametric tests Independent Samples Add PerformanceScore as the Test Field and Design as the Groups variable on the Fields tab Select the Settings tab and Customize tests and the Kruskal-Wallis test on the Settings tab Then select Run Returns a significance value < (ignore the note below the result as before) Very strong evidence that there are differences between the groups (as before) 23

24 ANOVA with unequal variances Our data set violated the normality of errors assumption but there were also differences in variances The Brown-Forsythe and Welch tests should only be used with unequal variances if the data and errors are normally distributed (shown for illustration purposes here) Under Options select Brown-Forsythe and Welch tests Both tests are again significant at 99.9% Very strong evidence that the means are not equal Generally the Welch test is slightly better unless there is one group with an extreme mean and a large variance (which was not the case here, so the Welch test should be preferred) see (Field, 2013: 443) 24

25 Multiple comparisons What if we conclude there are differences between the groups? We don t know which pairs are different We can do post-hoc tests to compare each pair of groups Similar to 2-sample tests but adjusted significance levels for the multiple testing issue Note: You should only run post hoc tests if you obtain a positive result from the ANOVA (or equivalent) test Which post hoc test? For equal group sizes and similar variances, use Tukey (HSD) or REGWQ, or for guaranteed control over Type I errors (more conservative), use Bonferroni For slightly different group sizes, use Gabriel For very different group sizes, use Hochberg s GT2 For unequal variances, use Games-Howell (also recommended as a backup in other circumstances) Source: (Field, 2013: 459) 25

26 Our data set violated the normality of errors assumption but there were also significant differences in variances Try using the Games-Howell post hoc test (shown for illustration purposes only) Run the One-Way ANOVA as before Select Post Hoc and Games-Howell Very strong evidence of differences between groups 1 and 3 Evidence of differences between groups 1 and 2 Weak evidence of differences between groups 2 and 3 26

27 Nonparametric post hoc testing This is the correct post hoc testing for our data set Double click on this output box in the output window: This opens the Model Viewer window Change the View to Pairwise Comparisons 27

28 The output should then look like this. Concentrate on the Adjusted Sig. values: Weak evidence of a difference between Design 1 and Design 2 Very strong evidence of a difference between Design 1 and Design 3 No evidence of a difference between Design 2 and Design 3 Note SPSS version 22 does not use Mann-Whitney U test in its Kruskal-Wallis post hoc testing but a variant called the Dunn- Bonferroni test The Sig. values given by the pairwise comparison in the Model Viewer are higher that those for the Mann-Whitney U test (e.g. for Designs 1 and 2 was found earlier to be 0.013; note: need one more decimal place to calculate the correction) However, we can still use their relative size to decide which pairs to run an individual post hoc test using the Mann-Whitney U test For our dataset, we do not need to run Designs 2 and 3 because we know it will be non-significant even with a correction for this bug but we should run Designs 1 and 3 To obtain the adjusted Sig. values, multiply the Sig. value by the number of pairs 28

29 Legacy Mann-Whitney U test The Mann-Whitney U test will not work in the new dialog with three groups Use the legacy dialog instead: Analyze > Nonparametric Tests > Legacy Dialogs > 2 Independent Samples Choose groups 1 and 3 Mann-Whitney U is the default test Returns the value Double click on the Exact Sig. output to check it to one more decimal place Example 2: Summary of results Pair Games- Howell Kruskall- Wallis (Dunn- Bonferroni) Post hoc test 1 and and 3 < < < and Not tested Piarwise Mann- Whitney U with Bonferroni adjustment According to the preferred (second and third) tests there is: Very strong evidence of a difference between Designs 1 and 3 (Weak) evidence of a difference between Designs 1 and 2 No evidence of a difference between Designs 2 and 3 29

30 Recap: We have considered: Two groups: Descriptives Assumption checking (for parametric tests) Independent samples t-test Mann Whitney U test Several groups: Descriptives Assumption checking (for parametric tests) One-way ANOVA Kruskal-Wallis test Post hoc testing statstutor resources Normality checking (draft electronic copy provided) Normality checking solutions (draft electronic copy provided) Independent samples t-test (paper copy provided) Mann-Whitney U test (available from statstutor website) One way ANOVA (paper copy provided) One way ANOVA additional material (available from statstutor website) Kruskal-Wallis test (draft electronic copy provided) 30

31 References IBM (2014) Post hoc comparisons for the Kruskal-Wallis test ibm.com/support/docview.wss?uid=swg IBM developerworks (2015) Bonferroni with Mann-Whitney? ml/topic?id= ad0-4f26-9a ac4f. Field, A. (2013) Discovering Statistics using SPSS: (And sex and drugs and rock 'n' roll). 4 th edn. London: SAGE. Sawilowsky, S. S. and Blair, R. C. (1992) A more realistic look at the robustness and Type II error properties of the t test to departures from population normality. Psychological Bulletin, 111(2), pp Statistica (n.d.) Statistica Help: Nonparametric Statistics Notes Kruskall-Wallis ANOVA by Ranks and Median Test. 31