Running head: MULTIVARIATE TESTS IN INDEPENDENT GROUPS DESIGNS. Multivariate Tests of Means in Independent Groups Designs:

Transcription

1 Multivariate Tests of Means Running head: MULTIVARIATE TESTS IN INDEPENDENT GROUPS DESIGNS Multivariate Tests of Means in Independent Groups Designs: Effects of Covariance Heterogeneity and Non-Normality Lisa M. Lix and H. J. Keselman University of Manitoba Author Contact Information: Lisa Lix Department of Community Health Sciences, Faculty of Medicine McDermot Avenue University of Manitoba Winnipeg, MB R3E 3P5 Biographical Information: Lisa Lix is Assistant Professor, Department of Community Health Sciences, University of Manitoba, Winnipeg, Canada Her research interests are in the areas of longitudinal data analysis, multivariate methods, and robust estimation and testing. Her current publications are found in British Journal of Mathematical and Statistical Psychology, Multivariate Behavioral Research, Psychophysiology, and Journal of Community Health and Epidemiology. Harvey Keselman is Professor of Psychology, University of Manitoba, Winnipeg, Canada. His areas of interest include the analysis of repeated measurements, multiple comparison procedures and robust estimation and testing. His publications have appeared in journals such as British Journal of Mathematical and Statistical Psychology, Educational and Psychological Measurement, Journal of Educational and Behavioral Statistics, Psychological Methods, Psychometrika, Psychophysiology and Statistics in Medicine.

2 Multivariate Tests of Means Abstract Health evaluation research often employs multivariate designs in which data on several outcome variables are obtained for independent groups of subjects. This article examines statistical procedures for testing hypotheses of multivariate mean equality in two-group designs. The conventional test for multivariate means, Hotelling s T, rests on certain assumptions about the distribution of the data, and the population variances and covariances. When these assumptions are violated, which is often the case in applied health research, T will result in invalid conclusions about the null hypothesis. We describe procedures that are robust, or insensitive, to assumption violations. A numeric example illustrates the statistical concepts that are presented and a computer program to implement these robust solutions is introduced.

3 Multivariate Tests of Means 3 Multivariate Tests of Means in Independent Groups Designs: Effects of Covariance Heterogeneity and Non-Normality Health evaluation research frequently involves the collection of multiple outcome measurements on two or more groups of subjects. For example, Harasym et al. (996) obtained scores on multiple personal traits using the Myers-Briggs inventory for nursing students with three different styles of learning. Knapp and Miller (983) discuss the measurement of multiple dimensions of healthcare quality as part of a system-wide evaluation program. In the simplest case where there are only two groups, such as a case and a control group, Hotelling s (93) T is the traditional method for testing that the means for the set of outcome variables are equivalent across groups (i.e., the hypothesis of multivariate mean equality). The T statistic is the analogue of Student s two-group t statistic for testing equality of group means for a single outcome variable. While Hotelling s T is the most common choice in the multivariate context, most applied researchers are unaware that it rests on a set of derivational assumptions that are not likely to be satisfied in evaluation research. Specifically, this test assumes that the outcome measurements follow a multivariate normal distribution and exhibit a common covariance structure. Multivariate normality is the assumption that all outcome variables and all combinations of the variables are normally distributed. The assumption of a common covariance matrix is that the populations will exhibit the same variances and covariances for all of the outcome variables. For example, with two groups and three dependent variables, the assumption of covariance homogeneity is that the three variances and three covariances are equivalent for the two populations from which the data were sampled. The T test is not robust to assumption violations, meaning that it is sensitive to changes in those factors which are extraneous to the hypothesis of interest (Ito, 980). In fact, this test

4 Multivariate Tests of Means 4 may become seriously biased when assumptions are not satisfied, resulting in spurious decisions about the null hypothesis. Moreover, the assumptions of normality and covariance homogeneity are not likely to be satisfied in practice. Outliers or extreme observations are often a significant concern in evaluation research (see e.g., Sharmer, 00). Furthermore, subjects who are exposed to a particular healthcare treatment or intervention may exhibit greater variability on the outcome measures than subjects who are not exposed to it (see e.g., Grissom, 000; Hill & Dixon, 98; Hoover, 00). Consequently researchers who rely on Hotelling s (93) T procedure to test hypotheses about equality of multivariate group means may unwittingly be filling the literature with non-replicable results or at other times may fail to detect intervention effects when they are present. This should be of significant concern to health evaluation researchers because the results of statistical tests are routinely used to make decisions about the effectiveness of clinical interventions and to plan healthcare program content and delivery. In this era of evidence-based decision making, it is important to ensure that the statistical procedures that are applied to evaluation data will produce valid results. Within the last 50 years, a number of statistical procedures that are robust to violations of the assumption of covariance homogeneity have been proposed in the literature. However, these procedures are largely unknown to applied researchers and therefore are not likely to be adopted in practice. Moreover, all of these procedures are sensitive to departures from multivariate normality. Recent research shows that it is possible to obtain a test that is robust to the combined effects of covariance heterogeneity and non-normality. This involves substituting robust measures of location and scale for the usual mean and covariances in tests that are insensitive to covariance heterogeneity. These robust measures are less affected by the presence of outlying scores or skewed distributions than traditional measures.

5 Multivariate Tests of Means 5 The purpose of this paper is to introduce health evaluation researchers to both the concepts and the applications of robust test procedures for multivariate data. This paper begins with an introduction to the statistical notation that will be helpful in understanding the concepts. This is followed by a discussion of procedures that can be used to test the hypothesis of multivariate mean equality when statistical assumptions are and are not satisfied. We will then show how to obtain a test that is robust to the combined effects of covariance heterogeneity and multivariate non-normality. Throughout this presentation, a numeric example will help to illustrate the concepts and computations. Finally, we demonstrate a computer program that can be used to implement the statistical tests described in this paper. Robust procedures are largely inaccessible to applied researchers because they have not yet been incorporated into extant statistical software packages. The program that we introduce will be beneficial to evaluation researchers who want to test hypotheses of mean equality in multivariate designs but are concerned about whether their data may violatr the assumptions which underlie conventional methods of analysis. Statistical Notation Consider the case of a single outcome variable. Let ij represent the measurement on that outcome variable for the ith subject in the jth group (i =,, n j ; j =, ). Under a normal theory model, it is assumed that ij follows a normal distribution with mean μ j and variance there are only two groups of subjects, the null hypothesis is σ j. When H μ μ. () 0 : In other words, one wishes to test whether the population means for the outcome variable are equivalent.

6 Multivariate Tests of Means 6 To generalize to the multivariate context, assume that we have measurements for each subject on p outcome variables. In other words, instead of just a single value, we now have a set of p values for each subject. Using matrix notation, ij represents the vector (i.e., row) of p outcome measurements for the ith subject in the jth group, that is, ij = [ ij ijp ]. For example, ij may represent the values for a series of measures of physical function or attitudes towards a healthcare intervention. It is assumed that ij follows a multivariate normal distribution with mean j and variance-covariance matrix j (i.e., ij ~ N[ j, j]). The vector μ j contains the mean scores on each outcome variable, that is, μ j = [μ j μ j μ jp ]. The variance-covariance matrix, j, is a p x p matrix with the variances for each outcome variable on the diagonal, and the covariances for all pairs of outcome variables on the off diagonal, Σ j σ σ j jp σ j σ σ jp jp. The null hypothesis, H0 μ : μ, () is used to test whether the means for the set of p outcome variables are equal across the two groups. As Knapp and Miller (983) observe, adopting a test of multivariate (i.e., joint) equivalence is preferable to adopting multiple tests of univariate equivalence, particularly when the outcome variables are correlated. Type I errors, which are erroneous conclusions about true null hypotheses, may occur when multiple univariate tests are performed. If the outcome variables are independent, the probability that at least one erroneous decision will be made on the set of p outcome variables is ( - α) p, which is approximately equal to αp when α is small,

7 Multivariate Tests of Means 7 where α is the nominal level of significance. For example, with α =.05 and p = 3, the probability of making at least one erroneous decision is.4. To illustrate the multivariate concepts that have been presented to this point, we will use the example data set of Table. These data are for two groups of subjects and two outcome variables. Let n j, represent the sample size for the jth group. The example data are for an unbalanced design (i.e., unequal group sizes), where n = 6 and n = 8. The vector of scores for the first subject of group is = [ 5], the vector for the second subject of group is = [8 48], and so on. Let j and S j represent the sample mean vector and sample covariance matrix for the jth group. Table contains these summary statistics for the example data set. The mean scores for the first outcome variable are 3.0 and 9.4 for groups and, respectively. For the second outcome variable, the corresponding means are 49.7 and The variances for groups and on the first outcome measure are 74.8 and.3, respectively. The larger variance for the first group is primarily due to the presence of two extreme values of and 8 for the first and second subjects, respectively. The corresponding variances on the second outcome variable are 3.9 and 3.6. For group, the covariance for the two outcome variables is 7.0. The population correlation for two variables q and q (i.e., ρ qq ; q, q =,, p) can be obtained from the covariance and the variances, q q σ σ q qq σ q, where σ qq is the covariance and σ q is the variance for the qth outcome variable. The sample correlation coefficient, r qq is used to estimate the population correlation coefficient. In the

8 Multivariate Tests of Means 8 example data set of Table, a moderate negative correlation of r qq =.4 exists for the two variables for group. Tests for Mean Equality when Assumptions are Satisfied Student s t statistic is the conventional procedure for testing the null hypothesis of equality of population means in a univariate design (i.e., equation ). The test statistic, which assumes equality of population variances, is t, (3) s n n where j represents the mean for the jth group and s, the variance that is pooled for the two groups is where s j is the variance for the jth group. n s n s s, (4) n n The multivariate Hotelling s (93) T statistic is formed from equation 3 by replacing means with mean vectors and the pooled variance with the pooled covariance matrix, T T S, n n (5) where T is the transpose operator, which is used to convert the row vector,, to a column vector, - denotes the inverse of a matrix, and S is the pooled sample covariance matrix, n S n S S. (6) n n

9 Multivariate Tests of Means 9 This test statistic is easily obtained from standard software packages such as SAS (SAS Institute, 999b). Statistical significant of this T statistic is evaluated using the T distribution. The test statistic can also be converted to an F statistic, N - p - F T T, (7) p(n - ) where N = n + n. Statistical significance is then assessed by comparing the F T statistic to its critical value F[p, N p ], that is, a critical value from the F distribution with p and N p degrees of freedom (df). When the data are sampled from populations that follow a normal distribution but have unequal covariance matrices (i.e., ), Hotelling s (93) T will generally maintain the rate of Type I errors (i.e., the probability of rejecting a true null hypothesis) close to if the design is balanced (i.e., n = n ; Christensen & Rencher, 997; Hakstian, Roed, & Lind, 979; Hopkins & Clay, 963). However, when the design is unbalanced, either a liberal or a conservative test will result depending on the nature of the relationship between the covariance matrices and group sizes. A liberal result is one in which the actual Type I error rate will exceed α. Liberal results are problematic; researchers will be filling the literature with false positives (i.e., saying there are treatment effects when none are present). A conservative result, on the other hand, is one in which the Type I error rate will be less than. Conservative results are also a cause for concern because they may result in test procedures that have low statistical power to detect true differences in population means (i.e., real effects will be undetected). If the group with the largest sample size also exhibits the smallest element values of j, which is known as a negative pairing condition, the error will be liberal. For example, Hopkins and Clay (963) showed that when group sizes were 0 and 0, and the ratio of the largest to the

10 Multivariate Tests of Means 0 smallest standard deviations of the groups was.6, the true rate of Type I errors for α =.05 was.. When the ratio of the group standard deviations was increased to 3., the Type I error rate was., more than four times the nominal level of significance! For positive pairings of group sizes and covariance matrices, such that the group with the largest sample size also exhibits the largest element values of j, the T procedure tends to produce a conservative test. In fact, the error rate may be substantially below the nominal level of significance. For example, Hopkins and Clay observed Type I error rates of.0 and.0, respectively for α =.05 for positive pairings for the two standard deviation values noted previously. These liberal and conservative results for normally distributed data have been demonstrated in a number of studies (Everitt, 979; Hakstian et al., 979; Holloway & Dunn, 967; Hopkins & Clay, 963; Ito & Schull, 964; Zwick, 986), for both moderate and large degrees of covariance heterogeneity. When the assumption of multivariate normality is violated, the performance of Hotelling s (93) T test depends on both the degree of departure from a multivariate normal distribution and the nature of the research design. The earliest research on Hotelling s T when the data are non-normal suggested that tests of the null hypothesis for two groups were relatively insensitive to departures from this assumption (e.g., Hopkins & Clay, 963). This may be true when the data are only moderately non-normal. However, Everitt (979) showed that this test procedure can become quite conservative when the distribution is skewed or when outliers are present in the tails of the distribution, particularly when the design is unbalanced (see also Zwick, 986). Tests for Mean Equality when Assumptions are not Satisfied Both parametric and nonparametric alternatives to the T test have been proposed in the literature. Applied researchers often regard nonparametric procedures as appealing alternatives

11 Multivariate Tests of Means because they rely on rank scores, which are typically perceived as being easy to conceptualize and interpret. However these procedures test hypotheses about equality of distributions rather than equality of means. They are therefore sensitive to covariance heterogeneity, because distributions with unequal variances will necessarily result in rejection of the null hypothesis. Zwick (986) showed that nonparametric alternatives to Hotelling s (93) T could control the Type I error rate when the data were sampled from non-normal distributions and covariances were equal. Not surprisingly, when covariances were unequal, these procedures produced biased results, particularly when group sizes were unequal. Covariance heterogeneity. There are several parametric alternatives to Hotelling s (93) T test. These include the Brown-Forsythe (Brown & Forsythe, 974, [BF]), James (954) first and second order (J & J), Johansen (980, [J]), Kim (99, [K]), Nel and Van der Merwe (986, [NV]), and ao (965, []) procedures. The BF, J, J, and J procedures have also been generalized to multivariate designs containing more than two groups of subject. The J, J, J, NV, and procedures are all obtained from the same test statistic, T T S S. n n (8) The J, NV, and procedures approximate the distribution of T differently because the df for these four procedures are computed using different formulas. The J and J procedures each use a different critical value to assess statistical significance. However, they both rely on largesample theory regarding the distribution of the test statistic in equation 8. What this means is that when sample sizes are sufficiently large, the T statistic approximately follows a chi-squared (χ ) distribution. For both procedures, this test statistic is referred to an adjusted χ critical value. If the test statistic exceeds that critical value, the null hypothesis of equation is rejected. The critical value for the J procedure is slightly smaller than the one for the J procedure. As a

12 Multivariate Tests of Means result, the J procedure generally produces larger Type I error rates than the J procedure, and therefore is not often recommended (de la Ray & Nel, 993). While the J procedure may offer better Type I error control, it is computationally complex. The critical value for J is described in the Appendix, along with the F-statistic conversions and df computations for the J, NV, and procedures. The K procedure is based on an F statistic. It is more complex than preceding test procedures because eigenvalues and eigenvectors of the group covariance matrices must be computed. For completeness, the formula used to compute the K test statistic and its df are found in the Appendix. The BF procedure (see also Mehrotra, 997) relies on a test statistic that differs slightly from the one presented in equation 8, T n n T BF S S. (9) N N As one can see, the test statistic in equation 9 weights the group covariance matrices in a different way than the test statistic in equation 8. Again, for completeness, the numeric solutions for the BF F statistic and df are found in the Appendix. Among the BF, J, J, K, NV, and tests there appears to be no one best choice in all data-analytic situations when the data are normally distributed, although a comprehensive comparison of all of these procedures has not yet been conducted. Factors such as the degree of covariance heterogeneity, total sample size, the degree of imbalance of the group sizes, and the relationship between the group sizes and covariance matrices will determine which procedure will afford the best Type I error control and maximum statistical power to detect group differences. Christensen and Rencher (997) noted, in their extensive comparison among the J, J, K, NV, and procedures, that the J and procedures could occasionally result in inflated

13 Multivariate Tests of Means 3 Type I error rates for negative pairings of group sizes and covariance matrices. These liberal tendencies were exacerbated as the number of outcome variables increased. The authors recommended the K procedure overall, observing that it offered the greatest statistical power among those procedures that never produced inflated Type I error rates. However, the authors report a number of situations of covariance heterogeneity in which the K procedure could become quite conservative. Type I error rates as low as.0 for α =.05 were reported when p = 0, n = 30, and n = 0. For multivariate non-normal distributions, Algina, Oshima, and Tang (99) showed that the J, J, J, and procedures could not control Type I error rates when the underlying population distributions were highly skewed. For a lognormal distribution, which has skewness of 6.8, they observed many instances in which empirical Type I error rates of all of these procedures were more than four times the nominal level of significance. Wilcox (995) found that the J test produced excessive Type I errors when sample sizes were small (i.e., n = and n = 8) and the data were generated from non-normal distributions; the K procedure became conservative when the skewness was 6.8. For larger group sizes (i.e., n = 4 and n = 36), the J procedure provided acceptable control of Type I errors when the data were only moderately nonnormal, but for the maximum skewness considered, it also became conservative. Fouladi and ockey (00) found that the degree of departure from a multivariate normal distribution was a less important predictor of Type I error performance than sample size. Across the range of conditions which they examined, the test produced the greatest average Type I error rates and the NV procedure the smallest. Error rates were only slightly influenced by the degree of skewness or kurtosis of the data, however these authors looked at only very modest departures from a normal distribution; the maximum degree of skewness considered was.75.

14 Multivariate Tests of Means 4 Non-normality. For univariate designs, a test procedure that is robust to the biasing effects of non-normality may be obtained by adopting estimators of location and scale that are insensitive to the presence of extreme scores and/or a skewed distribution (Keselman, Kowalchuk, & Lix, 998; Lix & Keselman, 998). There are a number of robust estimators that have been proposed in the literature; among these, the trimmed mean has received a great deal of attention because of its good theoretical properties, ease of computation, and ease of interpretation (Wilcox, 995a). The trimmed mean is obtained by removing (i.e., censoring) the most extreme scores in the distribution. Hence, one removes the effects of the most extreme scores, which have the tendency to shift the mean in their direction. One should recognize at the outset that while robust estimators are insensitive to departures from a normal distribution, they test a different null hypothesis than least-squares estimators. The null hypothesis is about equality of trimmed population means. In other words, one is testing a hypothesis that focuses on the bulk of the population, rather than the entire population. Thus, if one subscribes to the position that inferences pertaining to robust parameters are more valid than inferences pertaining to the usual least-squares parameters, then procedures based on robust estimators should be adopted. To illustrate the computation of the trimmed mean, let ()j ()j ( ) j, represent the ordered observations for the jth group on a single outcome variable. In other words, one begins by ordering the observations for each group from smallest to largest. Then let g j = [ n j ], where represents the proportion of observations that are to be trimmed in each tail of the n j distribution and [x] is the greatest integer x. The effective sample size for the jth group is defined as h j = n j g j. The sample trimmed mean,

15 Multivariate Tests of Means 5 n j g j t j (i)j h, j i g j (0) is computed by censoring the g j smallest and the g j largest observations. The most extreme scores for each group of subjects are trimmed independently of the extreme scores for all other groups. A fixed proportion of the observations is trimmed from each tail of the distribution; 0 percent trimming is generally recommended (Wilcox, 995a). The Winsorized variance is the theoretically correct measure of scale that corresponds to the trimmed mean (uen, 974) and is used to obtain the diagonal elements of the group covariance matrix. To obtain the Winsorized variance, the sample Winsorized mean is first computed, wj n j n j i Z ij. () where Z ij ( g j ) j if ij ( g j ) j ij if ( g j ) j ij (nj g j )j (nj g j )j if ij ( n j g j ) j. The Winsorized mean is obtained by replacing the g j smallest values with the next most extreme value, and the g j largest values with the next most extreme value. The Winsorized variance for the jth group on a single outcome variable, s wj, is

16 Multivariate Tests of Means 6 s wj n j i Z ij n j wj, () and the standard error of the trimmed mean is n s / h h j wj j j. The Winsorized covariance for the outcome variables q and q (q, q =,, p) is s wjqq n j i Z ijq wjq n j Z ijq wjq, (3) and the Winsorized covariance matrix for the jth group is S wj s s wj wjp s wj s wjp s wjp. To illustrate, we return to the data set of Table. For the first outcome variable for the first group, the ordered observations are,,, 3, 3, and with 0% trimming, g = [6 x.0] =. The scores of and 8 are removed and the mean of the remaining scores is computed, which produces.. Table contains the vectors of trimmed means for the two groups. To Winsorize the data set for the first group on the first outcome variable, the largest and smallest values in the set of ordered observations are replaced by the next most extreme scores, producing the following set of ordered observations t 8,,, 3, 3, 3.

17 Multivariate Tests of Means 7 The Winsorized mean, w, is.. While in this example, the Winsorized mean has the same value as the trimmed mean, these two estimators will not, as a rule produce an equivalent result. Table contains the Winsorized covariance matrices for the two groups. A test which is robust to the biasing effects of both multivariate non-normality and covariance heterogeneity can be obtained by using one of the BF, J, J, K, NV or test procedures, and substituting the trimmed means and the Winsorized covariance matrix for the least-squares mean and covariance matrix (see Wilcox, 995b). For example, with robust estimators, the T statistic of equation 5 becomes where T T * t t t Sw t t, h h (4) S * w n h S w n h S w. (5) Wilcox (995b) compared the K and J procedures when trimmed means and Winsorized covariances were substituted for the usual estimators when the data followed a multivariate nonnormal distribution. The Type I error performance of the J procedure with robust estimators was similar to that of the K procedure when sample sizes were sufficiently large (i.e., n = 4 and n = 36). More importantly, however, there was a dramatic improvement in power when the test procedures with robust estimators were compared to their least-squares counterparts; this was observed both for heavy-tailed (i.e., extreme values in the tails) and skewed distributions. The differences in power were as great as 60 percentage points, which represents a substantial difference in the ability to detect outcome effects.

18 Multivariate Tests of Means 8 Computer Program to Obtain Numeric Solutions Appendix B contains a module of programming code that will produce numeric results using least-squares and robust estimators with the test procedures enumerated previously, that is, the BF, J, J, K, NV, and procedures. The module is written in the SAS language (SAS Institute Inc, 999a). The IML (Interactive Matrix Language) component of SAS is required to run this program. This program can be used with either the PC or UNIX versions of SAS; it was generated using SAS version 8.. The program can be downloaded from Lisa Lix s website, In order to run the program, the data set, group sizes, proportion of trimming, and nominal level of significance, α must be input. It is assumed that the data set is complete, so that there are no missing values for any of the subjects on the outcome variables. The program generates as output the summary statistics for each group (i.e., means and covariance matrices). For each test procedure, the relevant T and/or F statistics are produced along with the numerator (ν ) and denominator (ν ) df for the F statistic, and either a p-value or critical value. These results can be produced for both least squares estimators and robust estimators with separate calls to the program. To produce results for the example data of Table with least-squares estimators, the following data input lines are required. ={ 5, 8 48, 49, 3 5, 3 5, 47, 9 46, 8 48, 8 50, 50, 9 45, 0 46, 48, 8 49}; NX={6 8}; PTRIM=0; ALPHA=.05; RUN TMULT; QUIT;

19 Multivariate Tests of Means 9 The first line is used to specify the data set,. Notice that a comma separates the series of measurements for each subject and parentheses enclose the data set. The next line of code specifies the group sizes. Again, parentheses enclose the element values. No comma is required to separate the two elements. The next line of code specifies PTRIM, the proportion of trimming that will occur in each tail of the distribution. If PTRIM=0, then no observations are trimmed or Winsorized. If PTRIM > 0, then the proportion specified is the proportion of observations that are trimmed/winsorized. To produce the recommended 0% trimming, PTRIM=.0. Note that a symmetric trimming approach is automatically assumed in the program; trimming proportions for the right and left tails are not specified. The RUN TMULT code invokes the program and generates output. Observe that each line of code ends with a semi-colon. Also, it is necessary that these lines of code follow the FINISH statement that concludes the program module. Table 3 contains the output produced by the SAS/IML program for each test statistic for the example data set. For comparative purposes the program produces the results for Hotelling s (93) T. We do not recommend, however, that the results for this procedure be reported. The output for least-squares estimators is provided first. A second invocation of the program with PTRIM=.0 is required to produce the results for robust estimators. As noted previously the program will output a T statistic and/or an F statistic, along with the df, and p-value or critical value. This information is used to either reject or fail to reject the null hypothesis. As Table 3 reveals, when least-squares estimators are adopted, all of the test procedures fail to reject the null hypothesis of equality of multivariate means. One would conclude that there is no difference between the two groups on the multivariate means. However, when robust estimators are adopted, all of the procedures result in rejection of the null hypothesis of equality of multivariate trimmed means, leading to the conclusion that the two groups do differ on the

20 Multivariate Tests of Means 0 multivariate means. These results demonstrate the influence that a small number of extreme observations can have on tests of mean equality in multivariate designs. Conclusions and Recommendations Although Hunter and Schmidt (995) argue against the use of tests of statistical significance, their observation that methods of data analysis used in research have a major effect on research progress (p. 45) is certainly valid in the current discussion. Recent advances in data-analytic techniques for multivariate data are unknown to the majority of applied health researchers. Traditional procedures for testing multivariate hypotheses of mean equality make specific assumptions concerning the data distribution and the group variances and covariances. Valid tests of hypotheses of healthcare intervention effects are obtained only when the assumptions underlying tests of statistical significance are satisfied. If these assumptions are not satisfied, erroneous conclusions regarding the nature or presence of intervention effects may be made. In this article, we have reviewed the shortcomings of Hotelling s (93) T test and described a number of procedures that are insensitive to the assumption of equality of population covariance matrices for multivariate data. Substituting robust estimators for the usual leastsquares estimators will result in test procedures that are insensitive to both covariance heterogeneity and multivariate non-normality. Robust estimators are measures of location and scale less influenced by the presence of extreme scores in the tails of a distribution. Robust estimators based on the concepts of trimming and Winsorizing result in the most extreme scores either being removed or replaced by less extreme scores. To facilitate the adoption of the robust test procedures by applied researchers, we have presented a computer program that can be used to obtain robust solutions for multivariate two-group data.

21 Multivariate Tests of Means The choice among the Brown-Forsythe (974), James (954) second order, Johansen (980), Kim (99), Nel and Van der Merwe (986) and ao (965) procedures with robust estimators will depend on the characteristics of the data, such as the number of dependent variables, the nature of the relationship between group sizes and covariance matrices, and the degree of inequality of population covariance matrices. Current knowledge suggests that the Kim (99) procedure may be among the best choice (Wilcox, 995b), because it does not result in liberal or conservative tests under many data-analytic conditions and provides good statistical power to detect between-group differences on multiple outcome variables. Further research is needed however, to provide more specific recommendations regarding the performance of these six procedures when robust estimators are adopted. Finally, we would like to note that the majority of the procedures that have been described in this paper can be generalized to the case of more than two independent groups (see e.g., Coombs & Algina, 996). Thus, applied health researchers have the opportunity to adopt robust test procedures for a variety of multivariate data-analytic situations.

22 Multivariate Tests of Means References Algina, J., Oshima, T. C., & Tang, K. L. (99). Robustness of ao s, James, and Johansen s tests under variance-covariance heteroscedasticity and nonnormality. Journal of Educational Statistics, 6, Brown, M. B., & Forsythe, A. B. (974). The small sample behavior of some statistics which test the equality of several means. Technometrics, 6, Christensen, W. F., & Rencher, A. C. (997). A comparison of Type I error rates and power levels for seven solutions to the multivariate Behrens-Fisher problem. Communications in Statistics Simulation and Computation, 6, Coombs, W. T., & Algina, J. (996). New test statistics for MANOVA/descriptive discriminant analysis. Educational and Psychological Measurement, 56, de la Rey, N., & Nel, D. G. (993). A comparison of the significance levels and power functions of several solutions to the multivariate Behrens-Fisher problem. South African Statistical Journal, 7, Everitt, B. S. (979). A Monte Carlo investigation of the robustness of Hotelling s one- and twosample T tests. Journal of the American Statistical Association, 74, Fouladi, R. T., & ockey, R. D. (00). Type I error control of two-group multivariate tests on means under conditions of heterogeneous correlation structure and varied multivariate distributions. Communications in Statistics Simulation and Computation, 3, Grissom, R. J. (000). Heterogeneity of variance in clinical data. Journal of Consulting and Clinical Psychology, 68, Hakstian, A. R., Roed, J. C., & Lind, J. C. (979). Two-sample T procedure and the assumption of homogeneous covariance matrices. Psychological Bulletin, 56,

23 Multivariate Tests of Means 3 Harasym, P. H., Leong, E. J., Lucier, G. E., & Lorscheider, F. L. (996). Relationship between Myers-Briggs psychological traits and use of course objectives in anatomy and physiology. Evaluation & the Health Professions, 9, Hill, M. A., & Dixon, W. J. (98). Robustness in real life: A study of clinical laboratory data. Biometrics, 38, Holloway, L. N., & Dunn, O. J. (967). The robustness of Hotelling s T. Journal of the American Statistical Association, 6, Hoover, D. R. (00). Clinical trials of behavioural interventions with heterogeneous teaching subgroup effects. Statistics in Medicine, 30, Hopkins, J. W., & Clay, P. P. F. (963). Some empirical distributions of bivariate T and homoscedasticity criterion M under unequal variance and leptokurtosis. Journal of the American Statistical Association, 58, Hotelling, H. (93). The generalization of student s ratio. Annals of Mathematical Statistics,, Ito, P. K. (980). Robustness of ANOVA and MANOVA test procedures. In P. R. Krishnaiah (ed.), Handbook of Statistics, Vol. (pp ). North-Holland: New ork. Ito, K., & Schull, W. J. (964). On the robustness of the T 0 test in multivariate analysis of variance when variance-covariance matrices are not equal. Biometrika, 5, 7-8. James, G. S. (954). Tests of linear hypotheses in univariate and multivariate analysis when the ratios of population variances are unknown. Biometrika, 4, Johansen, S. (980). The Welch-James approximation to the distribution of the residual sum of squares in a weighted linear regression. Biometrika, 67, Keselman, H. J., Kowalchuk, R. K., & Lix, L. M. (998). Robust nonorthogonal analyses

24 Multivariate Tests of Means 4 revisited: An update based on trimmed means. Psychometrika, 63, Kim, S. J. (99). A practical solution to the multivariate Behrens-Fisher problem. Biometrika, 79, Knapp, R. G., & Miller, M. C. (983). Monitoring simultaneously two or more indices of health care. Evaluation & the Health Professions, 6, Lix, L. M., & Keselman, H. J. (998). To trim or not to trim: Tests of mean equality under heteroscedasticity and nonnormality. Educational and Psychological Measurement, 58, Mehrotra, D. V. (997). Improving the Brown-Forsythe solution to the generalized Behrens- Fisher problem. Communications in Statistics Simulation and Computation, 6, Nel, D. G., & van der Merwe, C. A. (986). A solution to the multivariate Behrens-Fisher problem. Communications in Statistics Simulation and Computation, 5, SAS Institute Inc. (999a). SAS/IML user s guide, Version 8. Author: Cary, NC. SAS Institute Inc. (999b). SAS/STAT user s guide, Version 8. Author: Cary, NC. Schmidt, F., & Hunter, J. E. (995). The impact of data-analysis methods on cumulative research knowledge. Evaluation & the Health Professions, 8, Sharmer, L. (00). Evaluation of alcohol education programs on attitude, knowledge, and selfreported behavior of college students. Evaluation & the Health Professions, 4, Vallejo, G., Fidalgo, A., & Fernandez, P. (00). Effects of covariance heterogeneity on three procedures for analyzing multivariate repeated measures designs. Multivariate Behavioral Research, 36, -7. Wilcox, R. R. (995a). ANOVA: A paradigm for low power and misleading measures of effect size? Review of Educational Research, 65, 5-77.

25 Multivariate Tests of Means 5 Wilcox, R. R. (995b). Simulation results on solutions to the multivariate Behrens-Fisher problem via trimmed means. The Statistician, 44, 3-5. ao,. (965). An approximate degrees of freedom solution to the multivariate Behrens-Fisher problem. Biometrika, 5, uen, K. K. (974). The two-sample trimmed t for unequal population variances. Biometrika, 6, Zwick, R. (986). Rank and normal scores alternatives to Hotelling s T. Multivariate Behavioral Research,,

26 Multivariate Tests of Means 6 Brown and Forsythe (974) Appendix Numeric Formulas for Alternatives to Hotelling s (93) T Test The numeric formulas presented here are based on the work of Brown and Forsythe, with the modifications to the df calculations suggested by Mehrotra (997; see also Vallejo, Fidalgo, & Fernandez, 00). Let w j = n j /N and w j = w j. Then where ν BF = f p +, T BF is given in equation 9, and ν BF F BF TBF, (A) pf tr G tr G f. (A) tr w S tr w S tr ws tr ws n n In equation A, tr denotes the trace of a matrix, and G w S ws. The test statistic F BF is compared to the critical value F[ν BF, ν BF ], where ν BF tr G tr G tr w tr G S tr tr w S G tr w S tr w S, (A3) and G = w S + w S. James (954) Second Order The test statistic T of equation 8 is compared to the critical value p (A + p B) + q, where p is the α percentile point of the χ distribution with p df, A - - tr A A tr A A, p n n (A4) A j = S j /n j, A A A, and

27 Multivariate Tests of Means 7 B tr A AA A tr A A tr A A A A tr A A. p( p ) n n (A5) The constant, q is based on a lengthy formula which has not been reproduced here; it can be found in equation 6.7 of James (954). Johansen (980) Let F J = T /c where c = p + C 6C/(p + ) and C j tr n j - - A A j tr A A j. (A6) The test statistic F J is compared to the critical value F[p, ν J ], where ν J = p(p + )/3C. Kim (99) The K procedure is based on the test statistic F K ν K T V c mf -, (A7) where V A / / / / / A ra A AA A r, p hl l c, (A8) p h l p l l m, p (A9) h l h l l h l = (d l + )/(d l / +r), where d l is the l th eigenvalue of A, r = A A A /(p), and is the determinant. The test statistic F K is compared to the critical value F[m, ν K ], where ν K = f p +,

28 Multivariate Tests of Means 8 T j - - and b V A V. j Nel and van der Merwe (986) Let T, j n j b j f (A0) where ν N = f p + and ν NT F NV, (A) pf f The F NV statistic is compared to the critical value F[p, ν N ]. ao (965) The statistic F is referred to the critical value F[p, ν K ]. where f is given by equation A0 and ν K again equals f p +. tra tr A tra j tr A j. (A) n j j ν KT F, (A3) pf

29 Multivariate Tests of Means 9 Footnotes The sum of the eigenvalues of a matrix is called the trace of a matrix. The skewness for the normal distribution is zero.

30 Multivariate Tests of Means 30 Table. Multivariate Example Data Set Group Subject i i

31 Multivariate Tests of Means 3 Table. Summary Statistics for Least-Squares and Robust Estimators Least-Squares Estimators S S Robust Estimators t t S w S w

32 Multivariate Tests of Means 3 Table 3. Hypothesis Test Results for Multivariate Example Data Set Procedure Test Statistic df p-value/critical value (CV) Decision re: Null Hypothesis Least-Squares Estimators T T = 6. ν = p =.06 Fail to Reject F T =.8 ν = BF T BF = 9. ν =.4 p =.6 Fail to Reject F BF = 3.7 ν = 4.4 J T = 5.0 ν = CV = 4. Fail to Reject J T = 5.0 ν = p =.75 Fail to Reject F J =.3 ν = 6.9 K F K =.5 ν =.5 p =.64 Fail to Reject ν = 6. NV T = 5.0 ν = p =.37 Fail to Reject F NV =.0 ν = 4.4 T = 5.0 F =. ν = ν = 6. p =.98 Fail to Reject Robust Estimators T T = 59.0 ν = p =.00 Reject F T = 5.8 ν = 7 BF T BF = 3. ν =.5 p =.00 Reject F BF = 56. ν = 6.0 J T = 65. ν = CV = 3.3 Reject J T = 65. ν = p =.00 Reject F J = 9.5 ν = 6.3 K F K = 8. ν =.0 p =.00 Reject ν = 6.6 NV T = 65. ν = p =.00 Reject F NV = 7.9 ν = 6.0 T = 65. F = 8.3 ν = ν = 6.6 p =.00 Reject Note. T = Hotelling s (93) T ; BF = Brown & Forsythe (974), J = James (954) second order; J = Johansen (980), K = Kim (99); NV = Nel & van der Merwe (986), = ao (965)