Mann-Whitney U 2 Sample Test (a.k.a. Wilcoxon Rank Sum Test)

No-Parametric ivariate Statistics: Wilcoxo-Ma-Whitey 2 Sample Test 1 Ma-Whitey 2 Sample Test (a.k.a. Wilcoxo Rak Sum Test) The (Wilcoxo-) Ma-Whitey (WMW) test is the o-parametric equivalet of a pooled 2- Sample t-test. The test assumes you have two idepedet samples from two populatios, ad that the samples have the same shapes ad spreads, though they do t have to be symmetric. The WMW procedure is a statistical test of the differece betwee the two medias (η 1 ad η 2 ) uder the ull hypothesis that they are equal. Like the other o-parametric tests we have see so far, the WMW test works o raked data. The basic procedure is icredibly simple. Combie the two samples ito oe colum, rak the data from smallest to largest (where 1 = smallest), break them dow ito their origial samples ad sum up the total rak scores () of each. If the ull hypothesis is true the you would expect the two fial rak sums to be about equal; the larger the differece betwee the two scores, the more likely that the differece is real. To test for sigificace we calculate a expected score: ( ) = ( N +1) / 2 E (1) Where E() is the expectatio of, is the sample size of the sample beig tested, ad N is the total sample size N = 1 + 2. It turs out that the differece betwee the observed ad expected rak sums is best approximated through the use of a ormal distributio; the area uder the curve of a z-distributio. The umerator of the z score is as usual, but the deomiator is more complex, but after a buch of tedious algebra it turs out to be: E( ) ( N 1) / 12 z = (2) 1 2 + The resultig z score is the looked up i a table as usual, rememberig to adjust for oe or two tails. The WMW test establishes cofidece itervals aroud the media of the differeces betwee the two test samples called the poit estimate. This is ot so easy to do by had as first you would eed to calculate the poit estimate, ad the establish a cofidece level as close to the 95% level as possible through o-liear iterpolatio so, let MINITB do it. Let s work through a example. The ethohistoric idigeous peoples of the west Coast of North merica maitaied a hutig ad gatherig lifestyle, but oe based primarily o predictable aquatic resources. s such these huter-gatherer groups were much less mobile tha most huter-gatherer populatios, settig up seasoal permaet villages, ad developig a very complex hierarchical social structure. lthough the groups alog the west coast shared similar cultural traits, those to the orth were geerally more sedetary ad complex tha those to the south. Biford (2002) icludes a variety of data o such groups. For this test we are iterested i whether there is a sigificat differece betwee the mea aual populatio aggregatios of groups alog the laska coast ( = 13) ad the Califoria coast ( = 12).

No-Parametric ivariate Statistics: Wilcoxo-Ma-Whitey 2 Sample Test 2 Therefore, let η = the media populatio aggregatio of laska coastal groups, ad let η C = the media populatio aggregatio of Califoria coastal groups. Formally, we wish to test the followig ull hypothesis at the a = 0.05 (95%) cofidece level: H H O : η η : η η C C = 0 0 The data are as follows (umber of idividuals): laska ( = 13) Califoria ( = 12) 197 50.5 162 50 57 557 108 42 53.5 23 55 26 77 45 39 96 66 113 48 30 121 33 79 45 309 To check the assumptio of similar spreads ad shapes we ru the descriptive statistics ad produce a boxplot: Descriptive Statistics Variable N Mea Media Tr Mea StDev SE Mea laska 13 105.5 77.0 93.0 77.3 21.4 Califoria 12 92.5 45.0 53.1 148.8 43.0 Variable Mi Max Q1 Q3 laska Califoria 39.0 23.0 309.0 557.0 54.3 30.8 141.5 84.6 We see the mea does ot equal the media, ad lookig at the boxplot we see both distributios are heavily skewed to the right. While the spreads are a little differet, they are close eough for our purposes ad we ca coclude that both assumptios of the WMW are met.

No-Parametric ivariate Statistics: Wilcoxo-Ma-Whitey 2 Sample Test 3 600 500 400 laska 300 200 100 0 laska Califoria Below are the calculatios step by step for the hypothesis test: 1 2 3 4 5 6 7 8 laska Califoria Combied Factor Rak Factor laska Califoria 197 50.5 50.5 Califoria 11 Califoria 23 11 162 50 50 Califoria 10 Califoria 22 10 57 557 557 Califoria 25 Califoria 14 25 108 42 42 Califoria 6 Califoria 19 6 53.5 23 23 Califoria 1 Califoria 12 1 55 26 26 Califoria 2 Califoria 13 2 77 45 45 Califoria 7 Califoria 16 7 39 96 96 Califoria 18 Califoria 5 18 66 113 113 Califoria 20 Califoria 15 20 48 30 30 Califoria 3 Califoria 9 3 121 33 33 Califoria 4 Califoria 21 4 79 45 45 Califoria 7 Califoria 17 7 309 197 laska 23 laska 24 162 laska 22 laska 57 laska 14 laska 108 laska 19 laska 53.5 laska 12 laska 55 laska 13 laska 77 laska 16 laska 39 laska 5 laska 66 laska 15 laska 48 laska 9 laska 121 laska 21 laska 79 laska 17 laska 309 laska 24 laska

No-Parametric ivariate Statistics: Wilcoxo-Ma-Whitey 2 Sample Test 4 Colums 1 ad 2: The raw data Colum 3: The two colums combied ito oe Colum 4: Each observatio is labeled by a factor so that we ca keep track of where it belogs Colum 5: Colum 3 raked from smallest to largest Colum 6: The factor agai Colums 7 ad 8: The samples are recombied ad separated ito their origial groups The rak sum of the laska sample is 23+22+14+19+12+13+16+5+15+9+21+17+24=210 The rak sum C of the Califoria sample is 11+10+25+6+1+2+7+18+20+3+4+7=114 s + C = T we ca go ahead ad choose oe rak sum to work with, as they both will give the same result. We will use. Our expected value is give by equatio 1: ( ) E = ( N + ) 2 1 13(25 + 1) = = 169 2 The z score is calculated usig equatio 2: z = E( ) 210 169 = ( N + 1) /12 13*12* ( 25 1) / 12 1 2 + = 2.23 s.d. uits The 2-tailed probability associated with 2.23 s.d. uits uder the ormal curve is p = 0.026. s our p < a we reject the ull hypothesis at the 95% level i favor of the alterative that, i fact, there is a statistically sigificat differece betwee the mea aual populatio aggregatios of groups alog the laska coast ad those alog the Califoria coast. s the rak sum for the Califoria sample is much less tha the laska sample we could further coclude that o average populatio aggregatios are larger i laska. To look at the cofidece limits we eed to ru the test through MINITB. >STT >NON-PRMETRICS >MNN-WHITNEY >Put laska as the FIRST SMPLE ad Califoria as the SECOND >Leave the CONFIDENCE LEVEL as 95% >Leave the LTERNTIVE as NOT EQL >OK The output looks as follows:

No-Parametric ivariate Statistics: Wilcoxo-Ma-Whitey 2 Sample Test 5 Ma-Whitey Cofidece Iterval ad Test laska N = 13 Media = 77.0 Califoria N = 12 Media = 45.0 Poit estimate for ET1-ET2 is 27.0 95.3 Percet CI for ET1-ET2 is (4.5,75.0) W = 210.0 Test of ET1 = ET2 vs ET1 ot = ET2 is sigificat at 0.0276 The test is sigificat at 0.0276 (adjusted for ties) First thig to otice is that MINITB gives us both the hypothesis test ad the cofidece limits without us havig to ru the test twice ad selectig differet optios (I do t kow why). The sigificace value MINITB comes up with is p = 0.0276, which is slightly higher tha our had calculatio (p = 0.026), but ot eough to make ay differece to the outcome. For the cofidece limits we see the poit estimate = 27, that is the estimated media of the differece betwee the two samples. We see MINITB could ot fid us a cofidece level of 95% but achieved a level of 95.3%. The lower boud is 4.5, ad the upper is 75, ad as they do ot ecompass the hypothesized value of zero, we agree with the hypothesis test ad reject the ull hypothesis at the a = 0.05 level i favor of the alterative.