individualdifferences

Transcription

1 1 Simple ANalysis Of Variance (ANOVA) Oftentimes we have more than two groups that we want to compare. The purpose of ANOVA is to allow us to compare group means from several independent samples. In general, ANOVA procedures are generalizations of the t-test and it can be shown that, if one is only interested in the difference between two groups on one independent categorical (i.e. grouping variable), that the independent samples t-test is a special case of ANOVA. A one-way ANOVA refers to having only one independent grouping variable or factor, which is the independent variable. It is possible to have more than one grouping variable, but we will start with the simplest case. If one only has two levels of the grouping variable then one can simply conduct an independent samples t-test, but if one has more than two levels of the grouping variable than one needs to conduct an ANOVA. Since we have more than two groups in ANOVA we need to figure out a way to describe the difference between all the means. One way to do this is to figure out the variance between the sample means because a large variance implies that the sample means differ a lot, whereas a small variance implies that the sample means are not that different. This will give us a single numeric value for the difference between all the sample means. The statistic used in ANOVA partitions the variance into two components: (1) the between treatment 1 variability and () the treatment variability. Whenever from different samples are compared there are three sources that can cause differences to be observed between the sample means: 1. Difference due to Treatment. Individual Differences 3. Differences due to Experimental Error These are the three different sources of variability that can be cause one to observe differences between treatment groups and so these sources of variability are referred to as the between treatment variability. Only two of these sources of variability can be observed a treatment group, specifically individual differences and experimental error, and these are referred to as the treatment variability. The statistics used in ANOVA, the F statistic, uses a ratio of between treatment variability and treatment variability to test whether or not there is a difference among treatments. Specifically: F between tr eatment variabilty treatment var iablity treatment effect + individualdifferences + experiment alerror individualdifferences + experiment alerror 1 Note that groups do not always represent treatments. Oftentimes ANOVA is used to determine differences in intact groups such as those that differ by ethnicity or gender. It should be noted that your book, and many statistical software packages refer to the treatment variability as the error variability.

2 If the treatment effect is small than the ratio will be close to one. Therefore, an F-statistic close to one would be expected if the null hypothesis were true and there were no treatment differences. If the treatment effect is large then the ratio will be much greater than one because the between treatment variability will be much larger than the treatment variability. The hypotheses tested in ANOVA are: H 0 : µ 1 µ µ 3... µ K H 1 : at least one mean is different from the rest where K the total number of groups or sample means being compared In the population, group has mean µ and variance σ. In the sample, group has mean X and variance s. The sample size for each group is n and the total number of observations, N n 1 + n + n n K. The grand mean, of all observations is X. The assumptions underlying the test are the same as the assumption underlying the t-test for independent samples. Specifically, 1. Each group, in the population is normally distributed with mean µ. The variance in each group is the same so that σ1 σ K σ K σ, otherwise known as the homogeneity of variance assumption. 3. Each observation is independent of each other The computations underlying a simple one-way ANOVA are pretty straightforward if you remember that a variance is composed of two parts: (1) the sum of squared deviations from the mean () and () the degrees of freedom (df), which can be though of as the number of potentially different values that are used to compute the minus 1. Therefore the total variance, across all groups, is computed using total ( X X ) and df total N 1. We partition this variance into two parts, the treatment variance and the between treatment variance. Note that the total variance is simply the sum of treatment variance and between treatment variance and the df for the total variance is simply the sum of the df associated with the treatment variance and between treatment variance. The -treatment or -group variance is computed using error X X ), which represents the sum the squared deviations from each group mean and ( df df error (n 1 1) + (n 1) + (n 3 1)) + + (n K 1) (total number of observations) (number of groups) N K. The ratio of and df is known as the Mean Square groups ( ) or Mean Square Error ( error ) The between-treatment variability is computed using the between treatment n ( X X ), which represents the sum of the squared deviations of all group means from the grand (overall) mean and df between df treatment K 1, or the number of groups minus

3 3 one.. The ratio of between and df between is known as the Mean Square between groups ( between ) or Mean Square Treatment ( treatment ) The F-statistic is calculated by computing the ratio of Mean Square between groups ( between or treatment ) and Mean Square groups ( or error ). Specifically, F between This ratio follows a sampling distribution known as the F distribution which is a family of distributions based on the df of the numerator and the df of the denominator. Example A psychologist is interested in determining the extent to which physical attractiveness may influence a person s udgment of other personal characteristics, such as intelligence or ability. So he selects three groups of subects and asks them to pretend to be a company personnel manager and he gives them all a stack of identical ob applications which include picture of the applicants. One group of subects is given only pictures of very attractive people, another group is given only pictures of average looking people and a third group is given only pictures of unattractive people. Subects are asked to rate the quality of each applicant on a scale of 0 (which represents very poor qualities) to 10 (which represents excellent qualities). The following data is obtained: Attractive Average Unattractive What should he conclude? Well, we first need to calculate the grand mean and the means for each of the three groups: X X X X

4 4 ( X X Now we can calculate 3 ) N K (5 4.55) + (4 4.55) (6 5.9) + (5 5.9) 34 3 and between 11( ) n ( X K 1 X ) + 1( ) ( ) (4.36) (1.36) So the F-statistic 36.63/ , but how likely is it to have obtained this value if the null hypothesis is true? With and 31 df the critical F, at α.05, is approximately, 3.3. So the psychologist can reect the null hypothesis and conclude that person s udgment of the ob qualifications of prospective applicants appears to be influenced by how attractive the prospective applicant is. The ANOVA procedure is robust to violations of the assumptions, especially the assumption of normality. Violating the assumption of homogeneity of variance is especially problematic if the groups consist of different sample sizes. Levene s test, which we talked about before in terms of the t-test, can be used to test if the homogeneity of variance assumption has been violated. If it has, then the Welch procedure can be used to adust the df used in ANOVA, similar to what we talked about for the t-test. If the normality assumption is violated then the data can be transformed (because this won t change the results of the statistical test it will ust re-scale things) to be more normally distributed. Common transformation include: 1. Taking the square root of each observation is beneficial if the data is very skewed.. Taking the log of each observation is beneficial if the data is very positively skewed. 3. Taking the reciprocal of each observation (i.e. 1/observation) is beneficial if there are very large values in the positive tail of the distribution. Another approach to dealing with a violation of the normality assumption is to use a trimmed sample which removes a fixed percentage of the extreme values in each of the tails of the distribution or a Windsorized sample which replaces the values that are trimmed with the most extreme observations in the tail that are left. In the latter case the df need to be adusted by the number of values that are replaced. As we explore more complicated ANOVA models (models with more than one grouping variable) it will become important to be able to differentiate between fixed factors (or groups) and random factors. 3 Note: Answers obtained by hand, from Excel, or from a statistical software package will all most likely vary slightly due to rounding error.

5 5 A fixed factor is one in which the researcher is only interested in the various levels of the different groups that are being studied. These levels are not assumed to be representative of, nor generalizable to, other levels of the group. A random factor is one in which the researcher considers the various levels of the grouping variable to be a random sample from all possible levels. In this situation the results of the statistical test may be generalized to other levels of the group. It should be noted that there is a direct relationship between the t-test for independent samples and the ANOVA, when K. Specifically, it can be shown mathematically that the F-statistic the t-statistic, squared (i.e. F t ) Power and Effect Size Similar to the t-test, finding statistical significance does not tell us whether the differences are important from a practical perspective. Several measure of effect size have been proposed, all of which differ in terms of how biased they are. η (eta-squared) or the correlation ratio is one of the oldest measure of effect size. It represents the percentage of total variability that can be accounted for by differences in the grouping variable or the percentage by with the error variability (i.e. treatment variability) is reduced by considering group membership. This is done by calculating the ratio of between and total Specifically: η between total 73.5 For our previous example, η. 55, meaning 55% of the variation in ratings can be accounted for by differences in the independent variable (i.e. the groups). This effect size measure is biased upwards, meaning it is larger than would be expected if it were to have been calculated from the population, rather than estimated from the sample. An alternative effect size measure to η is ω (omega-squared). It also measures the percentage of total variability that can be accounted for by between group variability but does so by using values, rather than values, thereby making use of sample size information. Specifically, for a fixed effect 4 ANOVA: ω between ( k 1) total (3 1)(1.94) For our previous example, ω This measure of effect size has been found to be less biased than η. Note that it is smaller for what we obtained for η. 4 Note that this measure of effect size is computed slightly differently for a random effects ANOVA model and that the formula for a random effects ANOVA model is not presented here.

6 6 Estimating power for ANOVA is a straightforward extension of how power was estimated for the t-test. We simply use different notation, and different tables. Moreover, we assume equal sample sizes in each group, which is the optimal situation. In an ANOVA context, φ is comparable to d in the independent t-test context, and separates out the effect size from the sample size. However, we need to incorporate the fact that we are using variance estimates in the ANOVA context. Specifically, φ ( µ ) µ / K So, if we were to assume that the population values correspond exactly to what we obtained in our example (unlikely as this may be) then φ [( ) + ( ) ( ) ]/ Furthermore, in an ANOVA context, φ is comparable to δ in the independent t-test context, in that it incorporates sample size to allow us to determine how large of a sample we need to detect meaningful differences, from a practical perspective. However, even though we may wind up with unequal sample sizes in our group we calculate power based on the assumption of equal sample sizes. Specifically, φ φ n where n the number of subects in each group So, if we were to assume that we expected 1 subects in each of our groups in our example then: φ φ n In an ANOVA context we can use to the non-centrality parameter for the F distribution, which is the mean of the F-distribution if the null hypothesis is false, with K 1 and N K df for the numerator and denominator, respectively. For our example, we will use an estimate corresponding to φ 3.0, because our table in the book does not go any higher and we will compare it to the non-centrality parameter with df for the numerator and 30 df for the denominator (because our book does not have very fine gradiations for df in the denominator. Using the table in the book we find that β.03 if we want to conduct our test at α.01. Therefore, since Power 1 - β the power of the experiment we ran was approximately.97.