Data Analysis. Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) SS Analysis of Experiments - Introduction

Transcription

1 Data Analysis Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) Prof. Dr. Dr. h.c. Dieter Rombach Dr. Andreas Jedlitschka SS 2014 Analysis of Experiments - Introduction Some parts of this lecture are adopted with permission from lectures given by Sira Vegas and Oscar Dieste at UPM

2 Outline Descriptive statistics Statistical Analysis Parametric Tests Student s t-test Paired t-test One-way ANOVA Non-parametric Tests Mann-Whitney Wilcoxon Sign test Jedlitschka, Vegas, Dieste 2014 Slide 2

3 DESCRIPTIVE STATISTICS Jedlitschka, Vegas, Dieste

4 Important notice In inferential statistics, the population parameters are clearly differentiated from estimators (parameters calculated from samples) Population parameters are designated by Greek letters: μ, σ 2, σ Estimators are designated by Latin letters: m, s 2, s In most cases, symbols have an associated subscript denoting the associated sample (a treatment, usually): μ a, s b Jedlitschka, Vegas, Dieste 2014 Slide 4

5 Important notice The notational aspect is important because there are some differences in the calculation of estimators as compared to population parameters, concretely in the case of the variance: Sample variance It affects standard deviation, as it is the squared root of the variance (n-1) are the degrees of freedom of the sample. This will be important soon Jedlitschka, Vegas, Dieste 2014 Slide 5

6 Measures of central tendency Dataset: { 1, 2, 2, 2, 3, 14 } Arithmetic Mean Median Mode = 4 middle value of the ordered values: 2 Which one appears most often: 2 Measures differ in their response to outliers Jedlitschka, Vegas, Dieste 2014 Slide 6

7 Mean, Median, Mode Jedlitschka, Vegas, Dieste 2014 Slide 7

8 Dispersion (1/2) Dataset: {1, 2, 2, 2, 3, 14} Range {min, max}: {1, 14} Standard deviation (SD) σ if the data is from the population (N & μ) s if the data is from the population (N-1 & ) informs about the variation from the average Is the square root of the variance : 4,51 Jedlitschka, Vegas, Dieste 2014 Slide 8

9 Dispersion (2/2) Interquartile Range Jedlitschka, Vegas, Dieste 2014 Slide 9

10 Shape Variance σ² The average of the squared differences from the mean. Skewness Kurtosis Jedlitschka, Vegas, Dieste 2014 Slide 10

11 Dependency Linear regression Correlation coefficient (Pearson) Interval or ratio & normal distribution More than two variables: Multivariate analysis: principal component, moment_correlation_coefficient Jedlitschka, Vegas, Dieste 2014 Slide 11

12 Motivation STATISTICAL ANALYSIS Jedlitschka, Vegas, Dieste

13 A simple experiment Experiments don t have to be complicated. They can be so simple as comparing a technology to something else 1 factor Jedlitschka, Vegas, Dieste 2014 Slide 13

14 Distribution and Probability Find out whether this is a fair die! What could be the idea? Jedlitschka, Vegas, Dieste 2014 Slide 14

15 Solution Approach Either you have a trustworthy expectation Or Take by chance one of the dice Throw it one hundred times Note down each single event Derive distribution Now take this one and check whether it fulfils the expectation Jedlitschka, Vegas, Dieste 2014 Slide 15

16 A simple experiment Experiments don t have to be complicated. They can be so simple as comparing a pair of techniques 1 factor with 2 levels In cases like these, we don t need expensive tools (SPSS, STATA, etc.) to analyze the experimental results A scholar wants to know if technique A (say functional testing) is better than B (say inspection) He performs an experiment with some students and gets the following data (metric: higher value means better ): Technique A A B B A B B B A A B Measure Jedlitschka, Vegas, Dieste 2014 Slide 16

17 Question How can we decide which technique (A, B) is better? SPSS The most obvious option is looking at the data: Descriptive statistics Median, means Quartiles, variances, standard deviation and suitable plots Box plots Column1 A B 29,9 26,6 11,4 23,7 25,3 28,5 16,5 14,2 21,1 17,9 24,3 N 5 6 mean 20,84 22,53 variance 52,50 29,51 std. dev 7,25 5,43 Jedlitschka, Vegas, Dieste 2014 Slide 17

18 Box plot min Q1 Q3 max min Q1 Q3 max Jedlitschka, Vegas, Dieste

19 Preliminary answer B looks better, but the results are quite similar. We cannot be sure! It is likely that differences arise due to random chance Don t believe it? Remember what we found out with the dice. Or think about throwing a coin four times (What do you expect? What do you get?). As we can see from this example, many processes have an associated probability distribution How can we make a decision on this case? Jedlitschka, Vegas, Dieste 2014 Slide 19

20 Key question Idea: if we would know the probability distribution, we could calculate the probability that B > A Formally speaking: μ b > μ a Problem: What happens if we ignore the probability distribution? Jedlitschka, Vegas, Dieste 2014 Slide 20

21 Reference distribution Fisher claims that it is possible to relate the experimental results with a reference distribution, which is based on the same experimental data. Using this reference distribution, we can obtain an estimation about the likelihood of a given results under the assumption that A and B does not differ (that is, supposing that μ b = μ a ) Does the difference between the two groups represent a real difference or was it due to chance? Jedlitschka, Vegas, Dieste 2014 Slide 21

22 Standard distributions Building the reference distribution, even for a small example, requires a lot of effort. Under some assumptions, reference distributions are close to known probability distribution, such as normal (Gauss) distribution or, in our particular case, Students t t is used instead of the normal distributed when the sample sizes involved are small The good thing is that standard distributions are tabulated. Significance levels can be obtained immediately from the tables. Jedlitschka, Vegas, Dieste 2014 Slide 22

23 Use the standard distribution Calculate the actual difference between means Say d = ( b a) Locate d in the histogram Calculate the area of the histogram that falls at the right side of d That area is the probability that, by mater of chance, we could obtain a difference between means of value ( b a) or higher We call it p-value If the p-value is below a cutoff value α (significance level) we can affirm the techniques A and B are not alike α is arbitrarily set at 0.05 We say that we have obtained a significant result Jedlitschka, Vegas, Dieste 2014 Slide 23

24 Back to the Example Observed difference Null Hypothesis is not rejected Jedlitschka, Vegas, Dieste

25 Parametric Test / Independent Sample T-TEST Jedlitschka, Vegas, Dieste

26 T-Test One factor experiments with one level One-sample t-test Compare mean response of a group against a specific value The formula shows the general concept used by the following tests = mean (of groups 1 and 2) µ 0 = specified value (e.g., population mean) n = number of subjects in groups (1 and 2) (equal!!!) s = Standard Deviation of group (1 and 2) df = n-1 Lookup t in Student's t-distribution table to obtain p-value. Jedlitschka, Vegas, Dieste 2014 Slide 26

27 T-Test One factor experiments with two levels Two-sample t-test Checks the statistical signification of the difference between the mean responses of two levels of a factor Checks the null hypothesis of the samples belonging to two subpopulations where the mean X is the same Pre-requisites : the two sample sizes (that is, the number, n, of participants of each group) are equal; it can be assumed that the two distributions have the same variance. 2 H = mean (of groups 1 and 2) 1 n = number of subjects in groups (1 and 2) (equal!!!) s = Standard Deviation of group (1 and 2) s² = unbiased estimators of the variances df = 2n-2 H 0 : 2 2 Jedlitschka, Vegas, Dieste

28 T-Test One factor experiments with two levels Special cases Unequal sample sizes, equal variance df = n 1 + n 2-2 Equal or Unequal sample sizes, unequal variances (also Welch s t-test) Jedlitschka, Vegas, Dieste 2014 Slide 28

29 T-Test Project A B Program 3,42 3,44 Defect 2,71 4,97 density 2,84 4,76 1,85 4,96 3,22 4,10 3,48 3,05 2,68 4,09 4,30 3,69 2,49 4,21 1,54 4,40 3,49 1. Calculate means 2. Calculate difference of means 3. Use formula (unequal N) 4. Check obtained t value for respective df in t distribution table 5. Reject H0 if t0 > t α/2,df (two sided) 5. Reject H0 if t0 > t α,df (one sided) Data taken from Wohlin et al Jedlitschka, Vegas, Dieste 2014 Slide 29

30 t-distribution requirements There are three requirements 1. Samples must be independent and identically distributed (i.i.d.). In practice, it means that assignment of levels (A s and B s) to experimental units (subjects) have to be performed in a randomized way i.i.d. implies homoscedasticity and non-interaction 2. Accordingly, the mean estimator should be normally distributed (or close to normality) 3. Response variables are measured on ratio scales. Ordinal metrics cannot be used Condition #1 is probably more important than condition #2 and #3 Jedlitschka, Vegas, Dieste 2014 Slide 30

31 Non-parametric tests If condition #2 does not hold There are several test to check normality or condition #3 does not hold Ordinal metrics can be used non-parametric test can be applied Condition #1 must hold The Wilcoxon Rank Sum or Mann-Whitney Test is one most popular tests. Quite easy, but requires a minimum sample size and has some technical problems (power calculation) Jedlitschka, Vegas, Dieste 2014 Slide 31

32 Parametric vs. non-parametric Obviously, t distribution is an instance of a parametric test The main difference between both types of tests is the assumption of the distribution of the sample Non-parametric test do not make any assumption Non-parametric tests can be applied in situations where parametric cannot, but in turn they are more conservative (less power) Jedlitschka, Vegas, Dieste 2014 Slide 32

33 Non-Parametric Test / Independent Sample MANN WHITNEY U TEST Jedlitschka, Vegas, Dieste

34 Mann Whitney U test Non-parametric test for independent groups It has greater efficiency than the t-test on non-normal distributions Pre-requisites The responses are at least ordinal The distributions of both groups are equal under the null hypothesis Jedlitschka, Vegas, Dieste 2014 Slide 34

35 Mann Whitney U test Method 1: For small samples a direct method is recommended. It is very quick, and gives an insight into the meaning of the U statistic. Choose the sample for which the ranks seem to be smaller (The only reason to do this is to make computation easier). Call this "sample 1," and call the other sample "sample 2." For each observation in sample 1, count the number of observations in sample 2 that have a smaller rank (count a half for any that are equal to it). The sum of these ranks is U. Jedlitschka, Vegas, Dieste 2014 Slide 35

36 Mann Whitney U test Method 2: For larger samples, a formula can be used: Add up the ranks for the observations which came from sample 1. Where there are tied groups, take the rank to be equal to the midpoint of the group. The sum of ranks in sample 2 is now determinate, because the sum of all the ranks equals N(N + 1)/2 where N is the total number of observations. U is then given by: and R = Sum of Ranks for the respective group Reject H0 if min(u1, U2) is <= the critical value for the MW Jedlitschka, Vegas, Dieste 2014 Slide 36

37 Mann Whitney U test Project A Rank B Rank Program 3,42 9 3,44 10 Defect 2,71 5 4,97 21 density 2,84 6 4, ,85 2 4, ,22 8 4, , ,05 7 2,68 4 4, , , ,49 3 4, ,54 1 4, ,49 12 S of Ranks U 1 = 99 (use formula) U 2 = 11 (use formula) Check min(u 1, U 2 ) in table n of smaller sample n of larger sample 11 <= 26: reject H0 Data taken from Wohlin et al Table: Jedlitschka, Vegas, Dieste 2014 Slide 37

38 Parametric Test / Dependent Sample PAIRED T-TEST Jedlitschka, Vegas, Dieste

39 Paired T-Test Parametric test for dependent samples E.g., repeated measures or matched pairs differences between all pairs must be calculated = mean of differences between pairs µ 0 = (optional) specified value (e.g., population mean) n = number of subjects s D = Standard Deviation of differences (1 and 2) df = n-1 Jedlitschka, Vegas, Dieste 2014 Slide 39

40 Example Paired T-Test 1. Calculate differences (P1 P2) 2. Calculate mean of differences 3. Calculate std. dev. of differences 4. Use formula 5. Check t value for respective df in table 6. Reject H0 if t0 > t α/2,df (two sided) 6. Reject H0 if t0 > t α,df (one sided) Programmer P1 P2 P1 P ,1 18, ,9 16, ,3 32, N mean 131,10 127,73 3,37 variance 627, ,60 748,46 std. dev. 25,04 39,54 27,36 df (N 1) 9 Jedlitschka, Vegas, Dieste 2014 Slide 40

41 T-Test Table Reject H0 if t0 > t α/2,df (two sided) = => do not reject H0!!! Jedlitschka, Vegas, Dieste 2014 Slide 41

42 Table for T-Test SPSS Outputs Jedlitschka, Vegas, Dieste 2014 Slide 42

43 Non Parametric Test / Dependent Sample WILCOXON SIGN TEST Jedlitschka, Vegas, Dieste

44 Wilcoxon Non-parametric for dependent samples alternative to the paired t-test Pre-requisites It must be possible to determine which value is larger and to rank the differences T1 = 23 (sum negative d) d= P1 Ranks (d) T2 = 32 (sum positive d) Programmer P1 P2 P1 P2 P ,1 18,9 18, ,9 16,1 16, ,3 32,7 32, N 10,00 10,00 mean 131,10 127,73 3,37 variance 627, ,60 748,46 std. dev 25,04 39,54 27,36 T T+ Sum of Ranks Check min(u1, U2) in table 23!<= 8: do not reject H0 Jedlitschka, Vegas, Dieste 2014 Slide 44

45 Sign Test Non-parametric for dependent samples alternative to the paired t-test Used if it is not possible to rank the differences but still, at least ordinal scale Based on the signs of the difference Formula: Programmer P1 P2 P1 P2 Sign ,1 18, ,9 16, ,3 32, N 10,00 10,00 mean 131,10 127,73 3,37 variance 627, ,60 748,46 std. dev 25,04 39,54 27,36 Count + 6 T1 = 6 (# negative d) T2 = 4 (# positive d) n = min (T1, T2) do not reject H0!!! Jedlitschka, Vegas, Dieste 2014 Slide 45

46 Parametric Methods / Independent Sample ONE FACTOR ANOVA Jedlitschka, Vegas, Dieste

47 ONE-FACTOR ANOVA One factor experiments with more than two levels Checks the statistical significance of the difference between the mean responses of one factor with several levels Y ij j e ij j Y Y Steps: 1. Identify the mathematical model 2. Validation of the basic model that relates the experimental variables 3. Calculate the factor induced variation in the response variable 4. Calculate the statistical significance of the factor-induced variation 5. Establish consequences or recommendations on the alternative that provides the best response variable values j j Y Jedlitschka, Vegas, Dieste 2014 Slide 47

48 Example: ANOVA Factor = programming language levels = {ADA, C, C++, JAVA} Response variable = number of errors detected during three months after development ( Quality ) Number of subjects = 24 H 0 = There is no effect of the programming language on the quality of the program PRG Languages ADA C C++ JAVA N Mean Grand Mean 64 Jedlitschka, Vegas, Dieste 2014 Slide 48

49 Example: ANOVA Results: Descriptives ADA lead to a quality of 61±1.83 Jedlitschka, Vegas, Dieste 2014 Slide 49

50 Example: ANOVA Results: > do not reject H0: There are no significant differences between the variances of the two groups. => variances are equal There is a statistically significant difference between groups as determined by one way ANOVA (F = , p =.021). What do we know now? Jedlitschka, Vegas, Dieste 2014 Slide 50

51 Example: ANOVA Post Hoc Tests Scheffé because of different N. else Tukey is preferred There is statistically significant difference between ADA and C (C++) p=0.032 (p=0.002) and JAVA and C (C++) p=0.009 (p=0.000). There are no difference between ADA and JAVA as well as C and C++. Jedlitschka, Vegas, Dieste 2014 Slide 51

52 Example: ANOVA Homogeneous Subsets Jedlitschka, Vegas, Dieste 2014 Slide 52

53 Example: ANOVA Means Plot Jedlitschka, Vegas, Dieste 2014 Slide 53

54 Further Analysis Two-way ANOVA MANOVA ranova Multitude of other tests Jedlitschka, Vegas, Dieste 2014 Slide 54

55 DECISION TREE Jedlitschka, Vegas, Dieste 2014 Slide 55

56 References Wohlin, Runeson, Höst, Ohlsson, Regnell, Wesslén (2012). Experimentation in Software Engineering, Springer J. Bortz, and N. Döring (2006). Forschungsmethoden und Evaluation für Human- und Sozialwissenschaftler (4 Auflage). Berlin: Springer Verlag. N. Juristo and A. Moreno. (2001). Basics of Software Engineering Experimentation, Kluwer Academic Publishers. Jedlitschka, Vegas, Dieste 2014 Slide 56