The Analysis of Variance ANOVA

Transcription

1 -3σ -σ -σ +σ +σ +3σ The Analysis of Variance ANOVA Lecture / Dr. P. s Clinic Consultant Module in Probability & Statistics in Engineering

2 Today in P&S -3σ -σ -σ +σ +σ +3σ Analysis of Variance (ANOVA) Definitions Single Factor Anova Setting and assumptions The F-statistic Tests about the variance of two populations F-distribution and F-test Anova variables and Anova table ANOVA using MATLAB Multiple Comparisons in ANOVA 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

3 Definitions -3σ -σ -σ +σ +σ +3σ The analysis of variance (ANOVA) refers to a collection of experimental situations and statistical procedures for the analysis of quantitative responses from experimental units. The simplest form is known as single factor ANOVA or one-way ANOVA, and is usually used for comparing means of Data sampled from more than two populations, or Data from experiments in which more than two treatments have been used The characteristic that differentiates the treatments or populations from one another is called the factor under study, and the different treatments or populations are referred to as the levels of the factor. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

4 Examples -3σ -σ -σ +σ +σ +3σ An experiment to study geographic demographics (e.g., urban, suburban, rural, international urban, international rural) in overall student success Factor of interest is the geographic demographic, and there are five different qualitative levels. An experiment to study the effect of different diets (Mediterranean, Middle eastern, Southern US, Chinese, Atkins, Vegetarian, Low Carb) on cancer rates Factor is the diet, with seven different qualitative levels An experiment to study the effect of precise temperature on bacteria growth rate Factor is the temperature, and levels are quantitative in nature [0ºC ~ 0ºC] An experiment to study the chip defect rate of different VLSI technologies (0.0 micron, 0.05 micron, 0.08 micron, 0. micron) Factor is the size of the single component (transistor) on the chip, with four quantitative levels 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

5 -3σ -σ -σ +σ +σ +3σ Single Factor ANOVA Definitions and Assumptions In all of the above examples, there is one factor with multiple levels, and hence oneway (single factor) analysis of multiple populations. Some definitions and assumptions I: Number of populations or treatments being compared J i : Sample size for the i th population/treatment. Often J i =J, i=,,i IJ observations i : the mean value of the i th population, or the average response when the i th treatment is applied X ij : the random variable that denotes the j th measurement taken from the i th population x ij : the observed value of X ij X : Sample mean of the i th population, computed over all values of J i. X : Grand mean, the average of all I.J observations J I J i.. S i : Sample variance of the i th population X ij X ij j= i= j= X i. = X.. = J IJ Assumption: All I distributions are normal with the same variance σ. That is, each X ij is normally distributed with E(X ij )= ij and Var(X ij )= σ. We will accept this assumption as satisfied as long as max (σ i )<. min (σ i ) ( X X ) i =,, I 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering S i = J j= ij J i.

6 -3σ -σ -σ +σ +σ +3σ A typical dataset can be summarized as follows: Single Factor Experiments Treatment Observations Totals Averages x x x x x J i i x x x x x J i i I x x x x x I I IJ Ii Ii If we were to replace each trial with the mean of its observation, the difference between the mean and the observed value is called the residual. These are expected to have a normal distribution, which can be checked using a normality plot. eij = xij xi i 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering x ii x ii

7 -3σ -σ -σ +σ +σ +3σ Example In Class Exercise The following data shows the number of hours students of different colleges spend on homework; rows are number of hours spent studied by students randomly selected from each college, whereas the columns represent different colleges. 6 observations, (students) from Engineering ENG LAS COM EDU FPA Do student from certain colleges study harder? I=# of populations = 5 J=sample size of each population =6 x i. x i. j= 4 th student from each college 6 th student from each college = = 3.83 J x ij 3.33 x = x.. =455/30=5.6.. I.67 J i= j= x ij All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

8 -3σ -σ -σ +σ +σ +3σ Hypothesis testing for ANOVA The hypothesis of interest in one-way ANOVA is H 0 : = = 3 = = I vs. H a: at least two of the means are different If H 0 is true, then = = 3 = = I, and therefore, x., x., xi. should all be reasonably close to each other. The procedure to test this hypothesis is based on comparing a measure of between-samples variation to a measure of within-sample variation Within sample variation is the variation within each sample (each population). This variation is independent of whether H 0 is true or false, as this is the inherent variation within each sample, hence an indicator of noise / error within each sample. Between-samples variation, however, can indicate whether H 0 is true or false. This is because, the variation from one sample mean to another sample mean will only change significantly, if the population means are truly different, an indication that H 0 is false. Therefore, the ratio of the two gives an even stronger indication of whether H 0 is true: If the between samples variation is large, particularly when the within samples variation (noise) is small, then we have even more evidence against H All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

9 Within / Between? -3σ -σ -σ +σ +σ +3σ Between sample variation ENG LAS COM Variation within each sample EDU FPA Average of these is the within sample variation i x i. = s = Average variation among the sample means is the between sample variation 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

10 Test-Statistic -3σ -σ -σ +σ +σ +3σ The between-samples variation and within sample variation can quantitatively be expressed using mean square for treatment (MSTr) and mean square error (MSE), respectively. J MSTr = [( X X..) ( XI X..) ] I MSE = I S S... S I The test statistic for one-way anova is then J = ( Xi X..) I i The variations between each sample mean and the overall mean, hence a measure of between-samples variation Each sample variance measures the variation (noise) within that sample. The average of all sample variances is then the average within-sample variation, the mean-square error MSTr F = MSE 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

11 Test-statistic -3σ -σ -σ +σ +σ +3σ What value of F provides information regarding rejecting H 0? Recall that if H 0 is true, then = = 3 = = I, and therefore, x., x., xi. should be reasonably close to each other, and also to the grand mean x... Then, the differences between individual sample means and grand mean would be small, resulting in a small MSTr. Otherwise, the differences would be large resulting in a large MSTr. MSE, however, is independent of whether H 0 is true, as it relies on the underlying value of the sample variance. Therefore, we can assert that: When H 0 is true, E(MSTr) = E(MSE) = σ When H 0 is false, E(MSTr) > E(MSE) = σ Therefore, an F value >> indicating that MSTr >> MSE provides justifiable skepticism on H 0. The form of the rejection region is therefore, f c where, f is the observed value of the F statistic, c is the cutoff chosen to give enough benefit of the doubt to H 0. That is c is chosen such that P ( F c, when H 0 is indeed true) α, the desired significance level. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

12 -3σ -σ -σ +σ +σ +3σ χ -Distribution (Little side step) The F-distribution is related to Chi-squared (χ ) distribution: Let X,, X n be a random sample from a normal distribution with parameters and σ. Then the following random variable has a χ -distribution with ν=n- degrees of freedom ( n ) S ( X i X ) χ = = σ σ The χ -distribution is used in computing the confidence intervals of the variance (as opposed z or t- distribution used for the confidence interval of the mean) 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

13 In Matlab -3σ -σ -σ +σ +σ +3σ Matlab has several functions to compute various parameters of the χ distribution: Y = chipdf(x,v) computes the χ pdf at each of the values in X using the corresponding parameters in V (V can be a vector including several df s, in which case, Matlab will compute the pdf for each df. P = chicdf(x,v) computes the χ cdf at each of the values in X using the corresponding parameters in V X = chiinv(p,v) computes the inverse of the χ cdf with parameters specified by V for the corresponding probabilities in P. That is, given an area under the curve, this function computes the corresponding critical value, to the left of which the area is the specified value P (-alpha) [M,V] = chistat(nu) returns the mean and variance for the χ distribution with degrees of freedom parameters specified by NU. R = chirnd(v) generates random numbers from the χ distribution with degrees of freedom parameters specified by V. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

14 F-distribution -3σ -σ -σ +σ +σ +3σ The F probability distribution has two parameters v (number of numerator degrees of freedom) and v (number of denominator degrees of freedom). If X and X are independent χ rv s with v and v df, then, the following ratio has an F-distribution with their respective df s. F = X X ν ν Both χ and F distributions are non-symmetric. However, F-distribution has the interesting property that F α, v, v = / Fα, v, v 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

15 In Matlab -3σ -σ -σ +σ +σ +3σ Matlab has several functions that compute various parameters of the F- distribution: Y = fpdf(x,v,v) computes the F distribution pdf at each of the values in X using the corresponding parameters in V and V. P = fcdf(x,v,v) computes the F-distribution cdf at each of the values in X using the corresponding parameters in V and V. X = finv(p,v,v) computes the inverse of the F-distribution cdf with numerator degrees of freedom V and denominator degrees of freedom V for the corresponding probabilities in P. That is, given an area under the curve, this function computes the corresponding critical value, to the left of which the area is the specified value P (-alpha) [M,V] = fstat(v,v) returns the mean and variance for the F distribution with parameters specified by V and V R = frnd(v,v) generates random numbers from the F distribution with numerator degrees of freedom V and denominator degrees of freedom V. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

16 -3σ -σ -σ F-Test for Equality of Variances (By request) +σ +σ +3σ Let X,,X m and Y,,Y n be random (independent) samples from normal distributions with std. deviations σ and σ. If S and S are the sample std. deviations, then the following random variable has an F-distribution S / σ F = S / σ with ν = m- ν = n-. Then, the test-statistic for the observed value of two variances is f = s s 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

17 F-Test for Equality of Variances -3σ -σ -σ +σ +σ +3σ However, you can use Matlab s finv(.) function for any arbitrary α, ν and ν 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

18 -3σ -σ -σ +σ +σ +3σ Back to Anova Test-statistic (Reprise) What value of F provides information regarding rejecting H 0? Again, recall that if H 0 is true, then = = 3 = = I, and therefore,., x., xi. should be reasonable close to each other, and also to the grand mean x... Then, the differences between individual sample means and grand mean would be small, resulting in a small MSTr. Otherwise, the differences would be large resulting in a large MSTr. MSE, however, is independent of whether H 0 is true, as it relies on the underlying value of the sample variance. Therefore, we can assert that: When H 0 is true, E(MSTr) = E(MSE) = σ When H 0 is false, E(MSTr) > E(MSE) = σ Therefore, an F value >> indicating that MSTr >> MSE provides justifiable skepticism on H 0. The form of the rejection region is therefore, f c where, f is the observed value of the F statistic, c is the cutoff chosen to give enough benefit of the doubt to H 0. That is c is chosen such that P ( F c, when H 0 is indeed true) α, the desired significance level. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering x

19 The F-Test Example -3σ -σ -σ +σ +σ +3σ Let F = MSTr/MSE be the statistic in a single-factor ANOVA problem involving I populations or treatments with a random sample of J observations from each one. When H 0 is true (basic assumptions true), F has an F distribution with v = I and v = I(J ). The rejection region is then f F α, I-, I(J-) for the significance level α. F-dist. for ν =3, ν = 0 F 0.05, 3, 0 = xi. J MSTr =.. I MSE = f OBS ( x i x ) s i x.. Parameters I, J, ν, ν? I=4, J=6 ν =3, ν =4*(6-)=0 α= 0.05 H 0 : = = 3 = 4 ( ) + ( ) + ( ) + ( ) 4. = i= = 4, All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering [( 46.55) + ( 40.34) + ( 37.0) + ( 39.87) ] = MSTr = MSE = 4, /69.9 = 5.09 >>> 3.0

20 -3σ -σ -σ +σ +σ +3σ Other Formulas for ANOVA In practice, we compute the following related parameters to conduct an F-test: Using the sum of (instead of averages) of x ij s J I J xi. = xij x.. = x j= i= j= ij Treatment sum of squares SSTr = J SSE = Error sum of squares ( x ) ij xi. Total sum of squares SST I J i= j= I J I i= = x i= j= x i. IJ IJ ij x.. x.. Amount of variation that can be attributed to changes in differences in means of each sample Amount of variation due to inherent noise in each sample. The variation of each x i from its mean. Measure of total variation in the data; the difference between each measurement and the grand mean Fundamental Identity SST = SSTr + SSE Thus, the total variation (SST) can be partitioned into two pieces: SSE is the variation present within samples, and is present whether H 0 is true, and SSTr is the variation between the samples, which can only be explained by differences in sample means. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

21 -3σ -σ -σ +σ +σ +3σ Sum Squares and Mean Squares The statistics we compute SSTr and SSE are intimately related to MSTr and MSE: MSTr = SSTr MSE = SSE I I( J ) F = MSTr MSE = SSTr SSE I ( I ) ( J ) The F r.v. with ν =I- ν =I(J-) The computations for the ANOVA test, using the F-test, are often summarized in a tabular form, known as the ANOVA table 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

23 In Matlab -3σ -σ -σ +σ +σ +3σ p = anova(x) performs a one-way ANOVA for comparing the means of two or more columns of data in the m-by-n matrix X, where each column represents an independent sample containing m mutually independent observations. The function returns the p-value for the null hypothesis that all samples in X are drawn from the same population (or from different populations with the same mean). If the p-value is near zero, this casts doubt on the null hypothesis and suggests that at least one sample mean is significantly different than the other sample means. The anova function displays two figures. The first figure is the standard ANOVA table, which divides the variability of the data in X into two parts: Variability due to the differences among the column means (variability between groups). Variability due to the differences between the data in each column and the column mean (variability within groups). The second figure displays box plots of each column of X. Large differences in the center lines of the box plots correspond to large values of F and correspondingly small p-values. The ANOVA test makes the following assumptions about the data in X: All sample populations are normally distributed. All sample populations have equal variance. All observations are mutually independent. The ANOVA test is known to be robust to modest violations of the first two assumptions. [p,table,stats] = anova(...) returns the ANOVA table as a cell array as well as a stats structure that you can use to perform a follow-up multiple comparison test. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

24 -3σ -σ -σ +σ +σ +3σ Example In Class Exercise The following data shows the number of hours students of different colleges spend on homework; rows are number of hours spent studied by students randomly selected from each college, whereas the columns represent different colleges. ENG LAS COM EDU FPA 6 observations, (students) from Engineering th student from each college 6 th student from each college Do student from certain colleges study harder? H 0 : = = = 5 I=# of populations = 5 J=sample size of each population =6 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

25 Example -3σ -σ -σ +σ +σ +3σ x i. = J j= x ij = i= j= I SSTr = xi. x.. = J i= IJ x.. ( ) I J x ij 6 x i. = , x.. =455 x i. = [( ) ( ) ( ) ( ) ( ) ] ( 455) = All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering 30 ( ) 7.83 I J SSE = x ij x 7 4 9SSTr I i. f = = = = i= j= OBS 7 5SSE I( J ) = [( ) + ( ) + + ( ) 6 ] I J + [( ) + ( ) + + ( ) ] SST = xij x.. i= j= IJ + [( ) + ( ) + + ( ) ] ( 4) + ( 5) + + ( 3) + = ( 4) + ( 7) + + ( 6) + ( 455) = = SST = SSTr + SSE ( 9) + ( 4) + + ( 0) We can reject the null hypothesis that students F 0.05, 4, 5 =.76 f obs >>F α from all colleges work the same amount. We can also look at the p-value: What is the probability, that if H 0 were true, we would observe an f obs as large as ? In matlab: -fcdf(9.0077, 4, 5)= !!! MSTr MSE

26 Solution by Matlab -3σ -σ -σ +σ +σ +3σ [p, table, stats]=anova(data) p =.97e-004 table = 'Source' 'SS' 'df' 'MS' 'F' 'Prob>F' 'Columns' [ ] [ 4] [ ] [9.0076] [.97e-004] 'Error' [ ] [5] [.867] [] [] 'Total' [.360e+003] [9] [] [] [] stats = gnames: [5x char] n: [ ] source: 'anova' means: [ ] df: 5 s: All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

28 What Happens After We reject H 0-3σ -σ -σ +σ +σ +3σ Recall that H 0 : = = = I : If f obs <F α, or p>α, then we cannot reject H 0, and we accept that H 0 : = = I But what happens next, if f obs >F α and we reject H 0? We accept the alternative hypothesis, which means that not all means can be considered equal, so at least two of the means must differ But which ones? Multiple Comparisons Procedure The idea is to check all pair wise means, i - j (for all i<j), and compute the CI for each. Those intervals that do not include zero indicate that i and j differ significantly Those intervals that do include zero indicate that i and j do not differ significantly 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

29 -3σ -σ -σ +σ +σ +3σ Tukey s Procesure (T-Method for Multiple Comparisons) Use yet another distribution: Studentized Range Distribution (tables) Q α,m,ν : The upper-tail area beyond the α critical value, for the SR dist. with numerator df m and denominator df ν. With probability -α Xi. X j. Qα, I, I( J ) MSE / J i j Xi. X j. + Qα, I, I( J ) MSE / J for every i and j with i < j. Note that m=i (not I- as it was in F-dist.) and ν=i(j-). This formula computes the confidence interval for all i - j, but do we really need the entire confidence interval? We only need to know, whether the CI includes zero or not. There is a simpler form of the Tukey s test! 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

30 Simplified Tukey s Test -3σ -σ -σ +σ +σ +3σ. Select α and extract corresponding Q α,i,i(j-). Calculate w= Qα, II, ( J ) MSE / J 3. List the sample means in increasing order, underline those that differ by more than w. Any pair not underlined by the same line corresponds to a pair that are significantly different. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

31 Example -3σ -σ -σ +σ +σ +3σ Recall our bacteria count example, for which we had he following results: p =.97e-004 'Source' 'SS' 'df' 'MS' 'F' 'Prob>F' 'Columns' [ ] [ 4] [ ] [9.0076] [.97e-004] 'Error' [ ] [5] [.867] [] [] 'Total' [.360e+003] [9] [] [] [] Means: Let s compute w: w= Qα, II, ( J ) MSE / J w Q = α, I, I ( J ) = Q 0.05,5,5 MSE J = 8.03 Grp4 Grp3 Grp Grp5 Grp Sort means: How to interpret? 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

32 In Matlab -3σ -σ -σ +σ +σ +3σ c = multcompare(stats, alpha) performs a multiple comparison test using the information in the stats structure (from anova(.) ), and returns a matrix c of pairwise comparison results. It also displays an interactive figure presenting a graphical representation of the test. The output c contains the results of the test in the form of a five-column matrix. Each row of the matrix represents one test, and there is one row for each pair of groups. The entries in the row indicate the means being compared, the estimated difference in means, and a confidence interval for the difference. For example, suppose one row contains the following entries These numbers indicate that the mean of group minus the mean of group 5 is estimated to be 8.06, and a 95% confidence interval for the true mean is [.944, 4.497]. In this example the confidence interval does not contain 0.0, so the difference is significant at the 0.05 level. If the confidence interval did contain 0.0, the difference would not be significant at the 0.05 level. The multcompare function also displays a graph with each group mean represented by a symbol and an interval around the symbol. Two means are significantly different if their intervals are disjoint, and are not significantly different if their intervals overlap. You can use the mouse to select any group, and the graph will highlight any other groups that are significantly different from it. 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering

34 Homework -3σ -σ -σ +σ +σ +3σ From Chapter 0, 4 From Chapter 3,,4,8 Analyze the data given in these questions to obtain an ANOVA table, solve by hand and then by MATLAB and compare your results. If you do not get the same results, you did not solve correctly! 006 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering