Statistics: Module 2

Transcription

1 Statistics: Module 2 Geert Verbeke I-BioStat: Interuniversity Institute for Biostatistics and statistical Bioinformatics K.U.Leuven & Hasselt University, Belgium geert.verbeke@med.kuleuven.be verbeke PhD Biomedical Sciences

2 Contents 1 The comparison of two means: Unpaired data The comparison of two proportions: Unpaired data The comparison of two means: Paired data The comparison of two proportions: Paired data Errors in statistics: Basic concepts Errors in statistics: Practical implications One-sided versus two-sided tests Describing associations PhD Biomedical Sciences: Module 2 i

3 9 Non-parametric statistics Bibliography 234 PhD Biomedical Sciences: Module 2 ii

4 Chapter 1 The comparison of two means: Unpaired data Example Confidence interval for the difference of two means The unpaired t-test Assumptions Example: Survival times of cancer patients Example from the biomedical literature PhD Biomedical Sciences: Module 2 1

5 1.1 Example Consider an experiment in which weight gain in rats with high protein level diet is compared with weight gain in rats with low protein level diet. Group-specific histograms: PhD Biomedical Sciences: Module 2 2

6 Group-specific summary statistics: On average, there is an observed difference of 19g between the rats on a high protein diet and those on a low protein diet. Is this observed difference sufficient evidence to conclude that there indeed is an effect of diet on the weight gain? It would be of interest to know how likely such a difference of 19g is to occur if weight gain would be completely unrelated to the protein level of the diet. PhD Biomedical Sciences: Module 2 3

7 Note that, strictly speaking, we have two populations, with a sample randomly drawn from each: High protein rats: The hypothetical population of all rats that are given a high protein diet Low protein rats: The hypothetical population of all rats that are given a low protein diet From the first population, a random sample of n 1 = 12 rats was taken. From the second one, a random sample of n 2 = 7 rats was drawn. The corresponding observed means are x 1 = 120 and x 2 = 101 respectively. Because there is no relation between the observations taken from the first population and those taken from the second, we have unpaired data. PhD Biomedical Sciences: Module 2 4

8 1.2 Confidence interval for the difference of two means Let µ 1 and µ 2 be the (unknown) mean weight gain in the high and low protein population, respectively: Low protein High protein µ 2 µ 1 Of interest is to draw inferences about µ 1 µ 2 PhD Biomedical Sciences: Module 2 5

9 As always, our estimate of µ 1 µ 2 is µ 1 µ 2 = x 1 x 2 = 19 Based on the observed data, C.I. s can be constructed for µ 1 µ 2 For example, a 95% C.I. for µ 1 µ 2 is given by [ 2.19; 40.19] The true difference µ 1 µ 2 may or may not be in the interval [ 2.19; 40.19]. However, if 100 similar experiments would be conducted, then 95 out of the 100 corresponding C.I. s are expected to contain µ 1 µ 2. Hence, with 95% certainty, we can conclude that we believe µ 1 µ 2 to be within the interval [ 2.19; 40.19]. PhD Biomedical Sciences: Module 2 6

10 This C.I. shows that: the estimate (19g) of µ 1 µ 2 is a very imprecise estimate: the C.I. is very wide the estimate is up to units precise with 95% chance based on our data, it cannot be ruled out that µ 1 µ 2 would be zero, i.e., that there would be no difference between both populations. PhD Biomedical Sciences: Module 2 7

11 1.3 The unpaired t-test Often, it is of interest to test whether two populations have the same mean. This is translated in a set of hypotheses of the form: H 0 : µ 1 = µ 2 versus H A : µ 1 µ 2 We will reject the null hypothesis if the observed data show too much deviation from what is expected to see if the null hypothesis were correct Hence, we will reject H 0 if x 1 is much larger than x 2, or vice versa This is equivalent with rejecting H 0 if x 1 x 2 is too large PhD Biomedical Sciences: Module 2 8

12 Question: How large is too large? Answer: If the observed difference x 1 x 2 is very unlikely to happen by pure chance We therefore calculate the propability p of observing a similar experiment with mean difference between the groups of at least 19g, if µ 1 = µ 2. PhD Biomedical Sciences: Module 2 9

13 In our example, this probability equals p = : So, even if there is no relation at all between the protein content of the diet and weight gain, then one can still expect to observe a difference of at least 19g in 7.6% of the future similar experiments. Since p = > 0.05 = α, we consider this unsufficient evidence to conclude that the protein level would indeed affect the weight gain PhD Biomedical Sciences: Module 2 10

14 Conclusion: There is no significant difference (p = ) in weight gain between rats on a high protein level diet, and rats on a low protein level diet The above testing procedure is called the unpaired t-test since unpaired data are analysed, and since the calculation of the p-value is based on the t-distribution. PhD Biomedical Sciences: Module 2 11

15 1.4 Assumptions The calculation of the C.I., as well as the computation of the p-value are based on the sampling distribution of X 1 X 2, which describes what values for x 1 x 2 can be expected in case the experiment would be repeated many times. The sampling distribution of X 1 X 2 is completely determined from the sampling distribution of X 1 and X 2 In case of large samples, those distributions are known to be normal (CLT) In small samples, this normality of X 1 and X 2 is only valid in cases where the original data are (approximately) normally distributed. PhD Biomedical Sciences: Module 2 12

16 Therefore, in case of small samples, one assumes the outcome to be normally distributed in each group separately: Low protein High protein µ 2 µ 1 PhD Biomedical Sciences: Module 2 13

17 Conclusion: Low protein High protein Large samples: no assumptions µ 2 µ 1 Low protein High protein Small samples: Normality in both groups µ 2 µ 1 Note that the samples in our group were small (n 1 = 12 and n 2 = 7). Hence the histograms should be explored for any evidence against symmetry PhD Biomedical Sciences: Module 2 14

18 The group-specific histograms are: Note that, given the small sample sizes, assessment of symmetry is difficult This illustrates another drawback of small samples: Assumptions are often needed, which are very hard to check based on the observed data. PhD Biomedical Sciences: Module 2 15

19 Subject-matter knowledge can often help in deciding whether the underlying assumptions are realistic The unpaired t-test also implicitly asssumes that, both populations have the same variance This can be checked with a test for equality of variances, in which the following hypotheses are tested: H 0 : σ 2 1 = σ2 2 versus H A : σ 2 1 σ2 2 Most software packages automatically report the results from such a test, and even provide a corrected unpaired t-test, which corrects for the unequal variances: PhD Biomedical Sciences: Module 2 16

20 The variances are not significantly different from each other (p = ), such that our original result remains valid. Note that, since the variances are so similar, the corrected and uncorrected t-tests yield very similar results (p-values). Often, non-equality of the variances is associated with non-normality of the data PhD Biomedical Sciences: Module 2 17

21 1.5 Example: Survival times of cancer patients Based on data on survival times of cancer patients, we want to compare the surival times of stomach cancer patients with the survival times of colon cancer patients Summary statistics: We observe a large difference of = days in average survival time between both groups. PhD Biomedical Sciences: Module 2 18

22 On the other hand, there is a lot of variability between the subjects in both groups. Hence, it is not clear whether the observed difference of 171 days is sufficient evidence to conclude that survival times are indeed different for colon cancer patients and stomach cancer patients Results of the unpaired t-test: We do not find a significant difference between both groups, with respect to the survival time (p = ). PhD Biomedical Sciences: Module 2 19

23 However, the histograms suggest skewness in the data, such that the underlying assumption of normality becomes questionable: The skewness in the direction of the large values suggests that a logarithmic (or similar) transformation might be useful: X = survival time Y = ln(x) = ln(survival time) PhD Biomedical Sciences: Module 2 20

24 Histogram Possible transformations PhD Biomedical Sciences: Module 2 21

25 Stomach Colon X Y = ln(x) X Y = ln(x) PhD Biomedical Sciences: Module 2 22

26 As before, assessing symmetry is difficult due to the small number of observations in both groups. However, the evidence against symmetry is much weaker now. Results of unpaired t-test based on transformed data: The observed difference between both groups is still not significant (p = ), but the p-value is very different from what we obtained before the transformation (p = ). This illustrates that: assumptions need to be checked violation of assumptions can lead to serious errors PhD Biomedical Sciences: Module 2 23

27 Note that this is another example where geometric means and standard deviations would be useful to describe the location and spread in survival times in the two cancer groups separately: Stomach cancer Colon cancer Outcome mean (stand.dev.) mean (stand.dev.) Survival time (days) (3.49) (2.72) = exp(4.97) = exp(1.25) = exp(5.75) = exp(1.00) geometric means and standard deviations which is very different from the arithmetic means and standard deviations that were reported before: PhD Biomedical Sciences: Module 2 24

28 The fact that the formal test has been performed on the log-transformed survival times does not change the interpretation of the result If the log-transformed survival times are different for the two groups, then also the untransformed survival times Hence, although the conclusion, strictly speaking, should be that there is no significant difference in log survival times, it will often be formulated as there is no significant difference in survival times. PhD Biomedical Sciences: Module 2 25

29 1.6 Example from the biomedical literature Nissen et al. [1], Table 1: Large samples Similar variability in both groups p < rather than p = PhD Biomedical Sciences: Module 2 26

30 Kellett, Kellett, and Nordholm [2], Table 2: Relatively small samples Normality assumption NOT satisfied Variances NOT equal No reporting of the p-values PhD Biomedical Sciences: Module 2 27

31 Chapter 2 The comparison of two proportions: Unpaired data Example The chi-squared test Assumptions The Fisher Exact test Rows versus columns Example: Case-control data Example from the biomedical literature PhD Biomedical Sciences: Module 2 28

32 2.1 Example Consider data on sickness absence, collected on 585 employees with a similar job: Gender Sickness absence No Yes female male PhD Biomedical Sciences: Module 2 29

33 Research question: Is there a relation between absence and gender? 184/429 = 42.9% of the females, and 58/156 = 37.2% of the males have been absent This suggests that females are more absent than males However, even if absence due to sickness is equally frequent amongst males and females, the above results could have occurred by pure chance. It therefore would be of interest to calculate how likely it would be to observe such differences, by pure chance PhD Biomedical Sciences: Module 2 30

34 Note that we have again two populations, with a sample randomly drawn from each: Males: The hypothetical population of all male employees with similar job conditions Females: The hypothetical population of all female employees with similar job conditions From the first population, a random sample of n 1 = 156 males was taken. From the second one, a random sample of n 2 = 429 females was drawn. Let π 1 and π 2 denote the proportion of males and females in the total populations Then π 1 and π 2 can be estimated based on their sample versions π 1 = and π 2 = Because there is no relation between the observations taken from the first population and those taken from the second, we have unpaired data. PhD Biomedical Sciences: Module 2 31

35 2.2 The chi-squared test Often, it is of interest to test whether two populations have the same percentage of people with absence due to sickness. This is translated in a set of hypotheses of the form: H 0 : π 1 = π 2 versus H A : π 1 π 2 We will reject the null hypothesis if the observed data show too much deviation from what is expected to see if the null hypothesis were correct Hence, we will reject H 0 if π 1 is much larger than π 2, or vice versa This is equivalent with rejecting H 0 if π 1 π 2 is too large PhD Biomedical Sciences: Module 2 32

36 Question: How large is too large? Answer: If the observed difference π 1 π 2 is very unlikely to happen by pure chance We therefore calculate the propability p of observing a similar experiment with difference between the groups at least equal to π 1 π 2 = = 0.057, if π 1 = π 2 PhD Biomedical Sciences: Module 2 33

37 In our example, this probability equals p = 0.215: So, even if there is no relation at all between gender and absence, then one can still expect to observe a difference of 5.7% in 21.5% of the future similar experiments. Since p = > 0.05 = α, we consider this unsufficient evidence to conclude that the occurrence of sickness absence is related to gender PhD Biomedical Sciences: Module 2 34

38 Conclusion: There is no significant difference (p = 0.215) in prevalence of sickness absence between males and females The testing procedure needed for the comparison of proportions in unpaired data is called the chi-squared test since the calculation of the p-value is based on the chi-squared (χ 2 ) distribution. PhD Biomedical Sciences: Module 2 35

39 2.3 Assumptions The Fisher Exact test The calculation of the p-value is based on the sampling distribution of Π 1 Π 2, which describes what values for π 1 π 2 can be expected in case the experiment would be repeated many times. Note that Π 1 and Π 2 are the sample averages X 1 and X 2 of the binary variable sickness absence. Hence, for large samples, the sampling distribution of π 1 π 2 directly follows from the CLT In small samples, the normality of Π 1 and Π 2 can be problematic, and an alternative calculation of the p-value is needed. PhD Biomedical Sciences: Module 2 36

40 The Fisher Exact test provides an alternative way to calculate the p-value, without relying on the CLT, nor on the assumption of large samples. As an example, we consider again data on sickness absence, but from a second, much smaller, company: Gender Sickness absence No Yes female male The results based on the chi-squared as well as on the Fisher Exact test are: PhD Biomedical Sciences: Module 2 37

41 We observe considerable differences due to the (extremely) small sample sizes in both groups In larger samples, chi-squared and Fisher Exact produce much more similar p-values: Sickness absence p-value Company Males Females χ 2 Fisher Exact 1 58/ / /12 1/ / / /97 40/ /10 48/ /156 1/ /12 0/ /170 0/ / / PhD Biomedical Sciences: Module 2 38

42 The Fisher Exact test is very time-consuming, and cannot be calculated for large samples, except with special software. However, note that, for large samples, the chi-squared test remains possible, and yields results very similar to the ones that would have been obtained with the Fisher Exact test In practice, it is often standard to use Fisher Exact, unless computational restrictions require the use of chi-squared. Conclusion: Large samples: Chi-squared test Small samples: Fisher Exact test PhD Biomedical Sciences: Module 2 39

43 2.4 Rows versus columns When comparing two unpaired proportions, the data can always be summarized by a 2 2 table: Gender Sickness absence No Yes female A B A + B male C D C + D A + C B + D A + B + C + D in which A, B, C, and D represent the number of observations in each cell. The hypothesis of interest was to compare the prevalence of sickness absence between males and females. PhD Biomedical Sciences: Module 2 40

44 One can show that this is equivalent with comparing the percentage of males (females) between the employees with and without sickness absence: B A + B = D C C + D A + C = D B + D Proof: B A + B = D C + D B(C + D) = D(A + B) BC = AD C(B + D) = D(A + C) C A + C = D B + D This implies that, for the analysis of a 2 2 table, rows and columns can be interchanged. This is of interest for the analysis of case-control data PhD Biomedical Sciences: Module 2 41

45 2.5 Case-control data We consider the data on cervical cancer, where the relationship between the occurrence of cervical cancer and the age at first pregnancy is studied. Data were collected on 49 cancer cases and 317 non-cancer cases (controls). All women were asked about their age at first pregnancy, and the data are summarized as: Age Disease status Cervical cancer Control > PhD Biomedical Sciences: Module 2 42

46 Research question: Is there a relation between cancer and age? Of interest is to compare the prevalence of cancer between women with first pregnancy before the age of 25, and those with first pregnancy later. However, correct estimation of these percentages would have required a sample of women with first pregnancy before the age of 25, and a sample of women with first pregnancy later This was not the setup of the present experiment, where a number of cases and a number of controls are randomly selected, and where all women are then questioned about their age at first pregnancy. PhD Biomedical Sciences: Module 2 43

47 Such a design only allows correct estimation of the percentage of women with first pregnancy before the age of 25, for cases and controls separately. However, since rows and columns can be interchanged, this is sufficient to answer our research question of interest: PhD Biomedical Sciences: Module 2 44

48 For testing purposes, rows and columns can be interchanged, implying that the analysis of case-control data still answers the research question of interest For descriptive purposes, however, the choice between row and column percentages entirely depends on the design of the study. In the above example on cervical cancer, the row-percentages (i.e., percentage of women with first pregnancy before the age of 25), for cancer cases and controls separately, are the only ones that reflect the case-control nature of the experiment. PhD Biomedical Sciences: Module 2 45

49 2.6 Example from the biomedical literature Zuskin et al. [3], p.173 and Table 1: PhD Biomedical Sciences: Module 2 46

50 It is not clear when chi-squared is used, and when Fisher Exact is used PhD Biomedical Sciences: Module 2 47

51 Chapter 3 The comparison of two means: Paired data Example Confidence interval for the difference of two means The paired t-test The paired versus unpaired t-test Example Assumptions Example from the biomedical literature PhD Biomedical Sciences: Module 2 48

52 3.1 Example We consider the Captopril example, where blood pressure was taken in 15 hypertensive patients, before and after administration of the drug Captopril: PhD Biomedical Sciences: Module 2 49

53 Dataset Captopril Before After Patiënt SBP DBP SBP DBP Average (mm Hg) Diastolic before: Diastolic after: Systolic before: Systolic after: PhD Biomedical Sciences: Module 2 50

54 Research question: How does treatment affect BP? As in the unpaired t-test, we might consider this a two-sample case, where a sample is taken from each of two populations: Population 1: Patients without treatment Population 2: Patients after treatment with Captopril Let µ 1 be the population average BP if no treatment is given, and let µ 2 denote the population average BP after treatment. PhD Biomedical Sciences: Module 2 51

55 After treatment Without treatment µ 2 µ 1 Interest is in inference for the difference µ = µ 1 µ 2. The main difference when compared to the unpaired t-test is that each observation from the first sample now uniquely corresponds to one observation from the second sample, and vice versa. Hence, we have paired data PhD Biomedical Sciences: Module 2 52

56 In the case of unpaired data, µ would be estimated by the difference between the two sample averages: µ = µ 1 µ 2 = x 1 x 2 In the case of paired data, µ is estimated by the average of all subject-specific differences between BP s before and after treatment. More specifically, the variable of interest becomes the difference X in BP before and after treatment: X = BP before BP after PhD Biomedical Sciences: Module 2 53

57 The observed values x i for X can be calculated from the observed values of the BP in our sample: Before After Change Patiënt DBP DBP x i µ is the population mean of the variable X, and inference for µ can be based on the within-subject differences x i, rather than on the original BP measurements. PhD Biomedical Sciences: Module 2 54

58 3.2 Confidence interval for the difference of two means For example, a 95% confidence interval for µ is given by [4.91; 13.63]. Other confidence levels (99%, 90%,...) are possible as well The true average effect µ may or may not be in the interval [4.91; 13.63]. However, if 100 similar experiments would be conducted, then 95 out of the 100 corresponding C.I. s are expected to contain µ. Hence, with 95% certainty, we can conclude that we believe µ to be within the interval [4.91; 13.63]. PhD Biomedical Sciences: Module 2 55

59 This C.I. shows that: the estimate (9.27mmHg) of µ is a very imprecise estimate: the C.I. is very wide the estimate is up to 4.36 units precise with 95% chance based on our data and with 95% certainty, it can be ruled out that µ would be zero, i.e., that there would be no treatment effect at all. PhD Biomedical Sciences: Module 2 56

60 3.3 The paired t-test The hypothesis of interest is H 0 : µ 1 = µ 2 versus H A : µ 1 µ 2 This is equivalent with the following test about the mean of the difference X in bloodpressure: H 0 : µ = 0 versus H A : µ 0 We will reject the null hypothesis if the observed data show too much deviation from what is expected to see if the null hypothesis were correct Hence, we will reject H 0 if x is much larger or smaller than 0. PhD Biomedical Sciences: Module 2 57

61 This is equivalent with rejecting H 0 if x 0 is too large Question: How large is too large? Answer: If the observed difference x 0 is very unlikely to happen by pure chance We therefore calculate the propability p of observing a similar experiment with average observed effect of at least 9.27mmHg, if µ = 0. In our example, this probability equals p = PhD Biomedical Sciences: Module 2 58

62 So, if there would be no treatment effect at all, then one can expect to observe a difference of at least 9.27mmHg in only 0.1% of the future similar experiments. Since p = < 0.05 = α, we consider this sufficient evidence to conclude that Captopril affects the diastolic BP Conclusion: There is a significant difference (p = 0.001) in diastolic BP before and after treatment with Captopril The testing procedure is called the paired t-test since paired data are analysed, and since the calculation of the p-value is based on the t-distribution. PhD Biomedical Sciences: Module 2 59

63 3.4 The paired versus unpaired t-test What if the Captopril data were analysed using an unpaired t-test? PhD Biomedical Sciences: Module 2 60

64 Results from unpaired and paired t-tests, respectively: Unpaired: Paired: Although both tests lead to a significant result, there is a serious difference in p-values, showing that ignoring the paired nature of the data can lead to wrong conclusions. PhD Biomedical Sciences: Module 2 61

65 Conclusion: 15 2 measurements 30 1 measurement In general, the analysis of an outcome, measured multiple times per subject (repeated measures), requires different statistical procedures than when the outcome is measured only once for each subject. PhD Biomedical Sciences: Module 2 62

66 3.5 Example Obviously, it is important to correctly account for the paired nature of the data In practice, this requires knowledge about the design of the study and the way data have been collected As an example, suppose interest is in testing for differences in BMI between males and females Suppose that BMI measurements are available for 100 males and 100 females. The unpaired t-test is the obvious choice for the analysis, provided all assumptions are satisfied. Suppose now that the 100 males and females are taken from 100 married couples, would this change the preferred method for analysis? YES! PhD Biomedical Sciences: Module 2 63

67 3.6 Assumptions The calculation of the C.I. as well as the computation of the p-value is based on the sampling distribution of X, which describes what values for x can be expected in case the experiment would be repeated many times. In large samples, this sampling distribution is normal (CLT) In small samples, this normality is only valid in cases where the difference in BP is (approximately) normally distributed. Therefore, in case of small samples, one assumes the difference X to be normally distributed. Note that, in this context, the sample size refers to the number of pairs, not the number of observations in the data set PhD Biomedical Sciences: Module 2 64

68 Conclusion: Difference X Large samples: no assumptions µ = 0? Difference X Small samples: Normality for difference X µ = 0? In our Captopril example, the sample size was small (n = 15). Hence the histogram of the observed differences should be explored for any evidence against symmetry PhD Biomedical Sciences: Module 2 65

69 Histogram of observed differences: Assessment of symmetry is again difficult due to the small sample size, but there is no strong evidence for severe skewness. Note that the normality assumption is with respect to the difference X, not the original measurements. PhD Biomedical Sciences: Module 2 66

70 In our example, the original BP measurements (before and after treatment) are allowed to be skewed, as long as their differences are symmetrically distributed: After treatment Before treatment Difference X µ 2 µ 1 µ = 0? Hence, it is useless to check symmetry of the original observations. PhD Biomedical Sciences: Module 2 67

71 Note that, in case of skewness, it is often difficult and/or not helpful to transform the observed differences x i : Since often negative differences are observed, several standard transformations such as ln( ) or are not possible Even if a transformation such as, e.g., y i = ln(x i + 10) would yield symmetric observations y i, it is not clear what null hypothesis should be tested. Obviously, one can no longer test whether the mean of Y is equal to zero. In case of skewness, one therefore usually transforms the original data in such way that the differences become symmetric. This has the advantage that: Simple, standard, transformations can often be used One can still test for mean zero. PhD Biomedical Sciences: Module 2 68

72 For example, a potential transformation for the Captopril data would be: BP before BP after ln(bp before ) X ln(bp after ) = ln(bp before ) ln(bp after ) instead of: BP before BP after X = BP before BP after Y = ln(x + 5) PhD Biomedical Sciences: Module 2 69

73 3.7 Example from the biomedical literature Chen et al. [4], p. 76 and Tables 1 and 2: PhD Biomedical Sciences: Module 2 70

74 Paired t-test to test for time trends (IAC versus AOD) PhD Biomedical Sciences: Module 2 71

75 Unpaired t-test to test for group differences (SARS verus Control) PhD Biomedical Sciences: Module 2 72

76 Chapter 4 The comparison of two proportions: Paired data Example Mc Nemar test Assumptions Remark Mc Nemar versus chi-squared Example from biomedical literature PhD Biomedical Sciences: Module 2 73

77 4.1 Example Consider the data on the prevalence of severe colds in 1319 children, measured at the ages of 12 and 14. The response of interest is whether the child had severe colds during the last 12 months Severe colds at 12 yrs. Severe colds at 14 yrs. Yes No Yes No PhD Biomedical Sciences: Module 2 74

78 Research question: Is the prevalence of severe colds different at the two ages? At age 12, 356/1319 = 27% of the children reported severe colds. At age 14, this percentage equals 468/1319 = 35% These data suggest that the prevalence of severe colds increases with age. It would be of interest to know how likely the observed change in prevalence is to occur by pure chance. If this is very unlikely, the above data provide evidence that the prevalence indeed changes with age. Otherwise, the above data do not provide evidence for such a change. PhD Biomedical Sciences: Module 2 75

79 Note that the data structure is similar to the one in the Captopril data, in the sense that subjects are measured twice at different time points: Hence, we have again paired data. PhD Biomedical Sciences: Module 2 76

80 4.2 Mc Nemar test Let π 1 and π 2 be the percentage of children in the total population with a severe cold at the ages 12 and 14 respectively. Interest is in testing whether π 1 and π 2 are equal, which would reflect no change over time in the percentage of children with a severe cold. The hypothesis of interest is H 0 : π 1 = π 2 versus H A : π 1 π 2 Note that a change over time in the percentage of severe colds can only occur if children change their status: No severe cold at 12yrs severe cold at 14yrs Severe cold at 12yrs no severe cold at 14yrs PhD Biomedical Sciences: Module 2 77

81 Moreover, in order to have a change over time, more children should change in one direction than in the other Our test will therefore reject H 0 if the number of changers in one direction is much larger than the number of changers in the other direction. In our example, we will reject H 0 if is too large Question: How large is too large? Answer: If the observed difference is very unlikely to happen by pure chance PhD Biomedical Sciences: Module 2 78

82 We therefore calculate the probability p of observing a similar experiment with difference between the numbers of changers at least equal to = 112, if there would be no change over time in the total population. In our example, this probability equals p < : This p-value! So, if severe colds would occur equally frequently at both ages, it would be very unlikely to observe what has been observed in this particular experiment We therefore conclude that our data provide evidence that the probability of having a severe cold at the age of 12 is not the same as the probability of having a severe cold at the age of 14. PhD Biomedical Sciences: Module 2 79

83 Conclusion: There is a significant difference (p < ) in the occurrence of severe colds between the ages 12 and 14 The testing procedure needed for the comparison of proportions in paired data is called the Mc Nemar test. PhD Biomedical Sciences: Module 2 80

84 4.3 Assumptions Similarly to the chi-squared test, the calculation of the p-value is based on the assumption of a large sample In case of small samples, the p-value can be calculated without approximations based on CLT The exact calculation is similar to the Fisher Exact test for unpaired data. Many statistical packages only support the large-sample calculations. PhD Biomedical Sciences: Module 2 81

85 4.4 Remark As discussed before, the Mc Nemar test rejects H 0 if the off-diagonal elements are too different from each other, i.e., if there are many more changes in one direction than in the other direction. This implies that the testing procedure is independent of the observed diagonal elements Examples: Table: McNemar: comparison: vs vs result: p = p = PhD Biomedical Sciences: Module 2 82

86 4.5 Mc Nemar versus chi-squared There seems to be a lot of confusion about when Mc Nemar test and when chi-squared test should be used. As an example, consider the results from a survey in which 75 people were questioned about their intended vote in the US presidential elections, before and after a debate on the national television: Before TV debate After TV debate Reagan Carter Reagan Carter PhD Biomedical Sciences: Module 2 83

87 Depending on the research question, this table can be analysed in two different ways: Chi-squared: test for relation between vote before and after debate Mc Nemar: test for equal proportion Reagan voters before and after debate Hence, even when data are paired, the chi-squared test can be used Note that, in case of continuous data, there is no such choice: Unpaired data = Unpaired t-test Paired data = Paired t-test PhD Biomedical Sciences: Module 2 84

88 4.5.1 Mc Nemar test Research question: Before TV debate After TV debate Reagan Carter Reagan Carter Is the proportion Reagan voters the same before and after the debate? The observed proportions are 34/75 = 45.3% and 40/75 = 53.3% PhD Biomedical Sciences: Module 2 85

89 The p-value obtained from the Mc Nemar test equals p = : Hence the observed difference of 45.3% versus 53.3% would happen in 26.36% of the cases, even if the percentage of voters for Reagan is the same before and after the debate. Conclusion: The debate has not significantly changed the voting behaviour (p = ). PhD Biomedical Sciences: Module 2 86

90 4.5.2 Chi-squared test Research question: Before TV debate After TV debate Reagan Carter Reagan Carter Or equivalently: Is there a relation between voting behaviour before and after the debate? Is the proportion of Reagan voters after the debate the same amongst those who were in favour of Reagan before the debate as amongst those who were in favour of Carter before the debate? PhD Biomedical Sciences: Module 2 87

91 The observed proportions are 27/34 = 79.4% and 13/41 = 31.7% Note that this comes down to comparing the proportion of Reagan voters after the debate, between two separate groups: Those who were in favour of Reagan before the debate, and those who were not in favour of Reagan before the debate. Hence, we now compare unpaired proportions. The p-value obtained from the Chi-squared test equals p < : The observed difference of 79.4% versus 31.7% is very unlikely to happen if there would be no relation between the voting behaviour before and after the debate. PhD Biomedical Sciences: Module 2 88

92 Conclusion: There is a significant relation between the voting behaviour before and after the debate (p < ). PhD Biomedical Sciences: Module 2 89

93 4.5.3 General conclusion The survey results can be analysed in two different ways, leading to two different conclusions: Mc Nemar: There is no evidence that a TV debate would change the results of an election (p = ) Chi-squared: There is a strong relation between voting behaviour before and after the debate (p < ). Note that the proportion of Reagan voters before and after a TV debate could also be compared based on unpaired data. One then would question 75 people before the debate, and one would question 75 other people after the debate. PhD Biomedical Sciences: Module 2 90

94 The resulting 2 2 table would then contain 150 subjects: TV debate Preference Reagan Carter Before After The chi-squared test would compare the observed proportions 34/75 = 45.3% and 40/75 = 53.3%, which are the same ones as those compared before with the Mc Nemar test for the experiment with paired observations PhD Biomedical Sciences: Module 2 91

95 4.5.4 Some further examples There is no relation between (non-)significance of the chi-squared test and (non-)significance of the Mc Nemar test Examples: Table: χ 2 : comparison: vs vs vs vs result: p = p = p < p = McNemar: comparison: vs vs vs vs result: p = p < p = p = PhD Biomedical Sciences: Module 2 92

96 4.6 Example from biomedical literature De Clercq et al. [5], Abstract: Mc Nemar test to compare the presence of sumptoms before and after surgery. PhD Biomedical Sciences: Module 2 93

97 Chapter 5 Errors in statistics: Basic concepts Introduction Two types of errors Power Sample size calculation Examples Remarks Example from the biomedical literature PhD Biomedical Sciences: Module 2 94

98 5.1 Introduction Re-consider the example on the weight gain in rats, where interest is in the comparison between rats fed on a high or low protein diet Group-specific histograms: PhD Biomedical Sciences: Module 2 95

99 Group-specific summary statistics: On average, there is an observed difference of 19g between the rats on a high protein diet and those on a low protein diet. Based on the unpaired t-test, we obtained before that this observed difference is not sufficient evidence to believe that the weight gain is really different for the two diets (p = ) PhD Biomedical Sciences: Module 2 96

100 Conclusion: There is no significant difference (p = ) in weight gain between rats on a high protein level diet, and rats on a low protein level diet As indicated before, the result of a statistical test should be interpreted as evidence in favour or against the null hypothesis, and should not be interpreted as formal proof. In our example, the difference in weight gain between a population treated with one diet and a population treated with the other diet is too small to be detected based on 12 and 7 animals, respectively. PhD Biomedical Sciences: Module 2 97

101 Alternatively, if the t-test would have lead to p = 0.001, this would still not formally proof that there is a difference between both populations. After all, p = would only indicate that the observed difference of 19g occurs once every 1000 times, even if there is no difference at all between both populations. Maybe, our sample was indeed the extreme one that happens once every thousand experiments. Hence, whenever statistical tests are used, one has to be aware that errors in the conclusions can occur. It is therefore important to quantify the errors, and to keep them under control PhD Biomedical Sciences: Module 2 98

102 5.2 Two types of errors Reality H 0 correct H 0 not correct Test result Accept H 0 No error Type II error Reject H 0 Type I error No error Type I error: H 0 is incorrectly rejected Type II error: H 0 is incorrectly accepted PhD Biomedical Sciences: Module 2 99

103 5.3 Type I error A type I error occurs if H 0 is correct but the test leads to a significant result. Question: How likely is such an error to occur? Suppose the test is performed at the α = 5% level of significance If H 0 is correct, then one will observe a significant result in 5% of the cases Hence, in 5% of the cases, H 0 would be incorrectly rejected PhD Biomedical Sciences: Module 2 100

104 The probability of making a type I error is therefore equal to the chosen level α of significance. In practice, the probability of making a type I error is kept under control by choosing α sufficiently small In biomedical sciences α = 5% is often used, hereby allowing to make a type I error in 5% of the cases. Reality H 0 correct H 0 not correct Test result Accept H 0 Reject H 0 1 α If H 0 is correct, then the probability of making a type I error is α, while the probability of correctly accepting H 0 is 1 α. α 1 PhD Biomedical Sciences: Module 2 101

105 5.4 Type II error A type II error occurs if H 0 is incorrect but the test has not detected this, i.e., a non-significant result is obtained Question: How likely is such an error to occur? In contrast to the type I error, the probability of making a type II error is not easily controlled, and depends on various aspects of the sample(s) and population(s) PhD Biomedical Sciences: Module 2 102

106 In analogy to the type I error, the type II error rate is denoted by β Reality H 0 correct H 0 not correct Test result Accept H 0 1 α β Reject H 0 α 1 β 1 1 The power of a statistical test is 1 β, the probability of correctly rejecting H 0 PhD Biomedical Sciences: Module 2 103

107 5.5 Power In general, a specific testing procedure is acceptable, only if: the chance of making a type I error rate is sufficiently small the power to detect deviations from H 0 is sufficiently large The first condition can be met by specifying α sufficiently small. The second condition is more difficult to meet, as the power depends on various aspects of the sample(s) and population(s) This will be illustrated in the context of the comparison of two groups (such as the weight gain experiment) PhD Biomedical Sciences: Module 2 104

108 As before, let µ 1 and µ 2 represent the average weight gain in the total population, under high and low protein diets, respectively. The null and alternative hypotheses are given by H 0 : µ 1 = µ 2 versus H A : µ 1 µ 2 The power is the probability of correctly rejecting H 0. In that case, µ 1 µ 2, and we denote the true difference between both populations by = µ 1 µ 2 The unpaired t-test assumes the data to be normally distributed in both populations, with equal variability σ 2 PhD Biomedical Sciences: Module 2 105

109 Graphically: Low protein High protein.... σ2.... σ2 µ 2 µ PhD Biomedical Sciences: Module 2 106

110 5.5.1 Power as a function of α The smaller α, the smaller the power Intuitively: Type I errors are less likely if the null hypothesis is rejected less often. However, in cases where H 0 is truly wrong, it will still be rejected less often. An extreme case is obtained for α = 0: α = 0 implies that the null hypothesis is always accepted So, in case the null hypothesis is wrong, it is still accepted, leading to power 0 PhD Biomedical Sciences: Module 2 107

111 5.5.2 Power as a function of true difference The smaller, the smaller the power Intuitively: Large deviations from the null hypothesis are easier to detect Low protein High protein Low protein High protein µ µ 1 µ 2 µ 1... PhD Biomedical Sciences: Module 2 108

112 5.5.3 Power as a function of variability σ 2 The smaller σ 2, the larger the power Intuitively: Homogeneous groups are easier discriminated than heterogeneous groups Low protein High protein Low protein High protein µ µ 1 µ µ 1 PhD Biomedical Sciences: Module 2 109

113 5.5.4 Power as a function of sample size(s) The more observations, the larger the power Intuitively: More observations yields more information about the population(s), therefore implying more precision in the conclusions PhD Biomedical Sciences: Module 2 110

114 5.5.5 Conclusion The power depends on various aspects: Level of significance α True difference between the populations Within-group variance σ 2 Sample size(s) Note that the sample size is the only aspect under control of the investigator. In practice, one can calculate the sample size needed to reach a sufficiently high power. PhD Biomedical Sciences: Module 2 111

115 5.6 Sample size calculation As indicated before, a testing procedure is only acceptable if it has sufficient power, i.e., if the probability of making a type II error is sufficiently small. Since the sample size is the only aspect influencing the power, which is under control of the investigator, it is important that experiments are sufficiently large in order for the power to be sufficiently large as well The level α of significance is chosen such that the probability of making a type I error is sufficiently small The within-group variance σ 2 is pre-specified based on earlier, similar experiments, relevant literature, or a pilot study PhD Biomedical Sciences: Module 2 112

116 To be on the safe side, usually an upperbound for σ 2 is used: In case the variability would be smaller, the power would be higher, hence still sufficiently high In practice, is not known. Instead, the smallest which would still be clinically relevant to detect, is specified. If sufficient power is attained for the smallest meaningful, we have that: Any larger difference will be detected with even larger power We are not concerned about small powers for detecting smaller differences, as such differences are not relevant anyway. One can then calculate the number(s) of observations needed to reach a desired level of power. PhD Biomedical Sciences: Module 2 113

117 5.7 Example: Weight gain data In the weight gain data, the observed difference of 19g was found not to be significant (p = ) We can calculate the power that a real difference of 19g would be found significant if a new experiment were to be conducted, again with 12 and 7 observations in the high and low protein diet groups, respectively. Group-specific summary statistics, from the current experiment: PhD Biomedical Sciences: Module 2 114

118 Power calculations will be based on σ = 21, and α = 0.05 The power to detect a difference of 19g equals 43.45% Hence, with 12 and 7 observations respectively, there is only 43.45% chance that a true difference of 19g would be detected. If a difference of 19g is considered clinically relevant, then the weight gain experiment was clearly too small, since it is very likely that such a difference would remain undetected. We can also calculate the power for other values of PhD Biomedical Sciences: Module 2 115

119 Summary: Power to detect a difference 0g 5.00% 10g 15.70% 19g 43.45% 30g 80.80% 40g 96.49% : equal to α For example, 12 and 7 observations would be sufficient to show a true difference of 40g with more than 96% chance. Alternatively, one can also calculate how large the samples should be to detect a difference of, e.g., 20g with sufficiently high power. PhD Biomedical Sciences: Module 2 116

120 PhD Biomedical Sciences: Module 2 117

121 If a power of 90% is required to detect true effects as small as = 20g, at least 25 observations are needed in each group. With 30 observations in each group, the probability of making a type II error, when the true effect is not smaller than 20g, is approximately 5%. PhD Biomedical Sciences: Module 2 118

122 5.8 Example: Sickness absence We re-consider the data on sickness absence, collected on 585 employees with a similar job: Gender Sickness absence No Yes female male The observed difference between the absence rate 42.9% in females and 37.2% in males was found not significant (chi-squared test, p = 0.215). PhD Biomedical Sciences: Module 2 119

123 In case the percentages of sickness absence would be 42% in the total female population, and 37% in the total male population, and in case a random sample of 429 females and 156 males would be taken, there would be 19.01% chance to reach a significant effect. So, if the population proportions are indeed 42% and 37%, an experiment with 429 en 156 would detect this difference only 19 times out of 100 experiments. If a difference of 5% is considered clinically relevant, then the current experiment was clearly too small, since it is very likely that such a difference would remain undetected. We can calculate how large the samples should be in order to detect a difference between 42% and 37%, with sufficiently high power PhD Biomedical Sciences: Module 2 120

124 PhD Biomedical Sciences: Module 2 121

125 For example, two samples of approximately 2500 observations are needed in order to show a difference between 37% and 42%, with 95% probability PhD Biomedical Sciences: Module 2 122

126 5.9 Remarks The earlier examples of power and/or sample size calculations were in the context of the unpaired t-test and chi-squared test. Similar calculations can be done in any other statistical testing situation, e.g., Fisher Exact test, paired t-test, McNemar test,... Strictly speaking, all experiments should be preceded by a realistic sample size calculation to avoid experiments with unacceptable high type II error rates, i.e., with almost no chance at all to show clinically meaningful effects. PhD Biomedical Sciences: Module 2 123

127 5.10 Example from the biomedical literature Wong et al. [6] Methodology section, p.658: PhD Biomedical Sciences: Module 2 124

128 Table 2 with results: Discussion, p.664: PhD Biomedical Sciences: Module 2 125

129 The difference on which the sample size calculation was based was much larger than what actually was observed in the experiment Therefore, the power to reject equality of the groups was (much) lower than the expected 80% The current study cannot tell the difference between a 9% increase and a 3% decrease. If such differences are considered clinically important, then the current study was under-powered, due to the fact that the difference was overestimated at the time of the sample size calculation. PhD Biomedical Sciences: Module 2 126

130 Chapter 6 Errors in statistics: Practical implications Multiple testing Bonferroni correction Tests for baseline differences Equivalence tests Significance versus relevance Examples from biomedical literature PhD Biomedical Sciences: Module 2 127