Power Analysis and Sample size estimation

Transcription

1 Power Analysis and Sample size estimation The Power Analysis implements the techniques of statistical power analysis, sample size estimation, and advanced techniques for confidence interval estimation. The main goal of the first two techniques is to allow you to decide, while in the process of designing an experiment, (a) how large a sample is needed to allow statistical judgments that are accurate and reliable, and (b) how likely your statistical test will be to detect effects of a given size in a particular situation. The third technique is useful in implementing objectives (a) and (b) above, and in evaluating the size of experimental effects in practice. Performing power analysis and sample size estimation is an important aspect of all studies, because without these calculations, sample size may be too high or too low. If sample size is too low, the study will lack the precision to provide reliable answers to the questions it is investigating. If sample size is too large, time and resources will be wasted, often for minimal gain. The Power Analysis provides a number of graphical and analytical tools to enable precise evaluation of the factors affecting power and sample size in many of the most commonly encountered statistical analyses. This information can be crucial to the design of a study that is cost-effective and scientifically useful. Therefore, at the design stage of an investigation, you should aim to minimize the probability of failing to detect a real effect (type II error, false negative). The probability of type II error is equal to one minus the power of a study (probability of detecting a true effect). You must select a power level for your study along with the two sided significance level at which you intend to accept or reject null hypotheses in statistical tests. The significance level you choose (usually 5%) is the probability of type I error (incorrectly rejecting the null hypothesis, false positive). For further reading please see Armitage and Berry, 1994; Fleiss, 1981; Gardner and Altman, 1989; Dupont, 1990; Pearson and Hartley, G-Power estimates minimum sample sizes necessary to avoid given levels of type II error in the comparison of means, the comparison of proportions, Correlations and Regressions and variances. Remember that good design lies at the heart of good research and for important studies statistical advice should be sought at the planning stage. You should be familiar with the basic concepts of Statistics before you use this software. Please digest some introductory learning materials, such as Bland (2000) or selected web sites.

2 Errors, P values and Power The P value or calculated probability is the estimated probability of rejecting the null hypothesis (H0) of a study question when that hypothesis is true. The null hypothesis is usually an hypothesis of "no difference" e.g. no difference between blood pressures in group A and group B. Define a null hypothesis for each study question clearly before the start of your study. The only situation in which you should use a one sided P value is when a large change in an unexpected direction would have absolutely no relevance to your study. This situation is unusual; if you are in any doubt then use a two sided P value. The term significance level (alpha) is used to refer to a pre-chosen probability and the term "P value" is used to indicate a probability that you calculate after a given study. The alternative hypothesis (H1) is the opposite of the null hypothesis; in plain language terms this is usually the hypothesis you set out to investigate. For example, question is "is there a significant (not due to chance) difference in blood pressures between groups A and B if we give group A the test drug and group B a sugar pill?" and alternative hypothesis is " there is a difference in blood pressures between groups A and B if we give group A the test drug and group B a sugar pill". If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a "meaningful" or "important" difference; that is for you to decide when considering the real-world relevance of your result. The choice of significance level at which you reject H0 is arbitrary. Conventionally the 5% (less than 1 in 20 chance of being wrong), 1% and 0.1% (P < 0.05, 0.01 and 0.001) levels have been used. These numbers can give a false sense of security. In the ideal world, we would be able to define a "perfectly" random sample, the most appropriate test and one definitive conclusion. We simply cannot. What we can do is try to optimise all stages of our research to minimise sources of uncertainty. When presenting P values some groups find it helpful to use the asterisk rating system as well as quoting the P value: P < 0.05 * P < 0.01 ** P < Most authors refer to statistically significant as P < 0.05 and statistically highly significant as P < (less than one in a thousand chance of being wrong). The asterisk system avoids the woolly term "significant". Please note, however, that many statisticians do not like the asterisk rating system when it is used without showing P values. As a rule of thumb, if you can quote an exact P value then do. You might also want to refer to a quoted exact P value as an asterisk in text narrative or tables of contrasts elsewhere in a report.

3 At this point, a word about error. Type I error is the false rejection of the null hypothesis and type II error is the false acceptance of the null hypothesis. As an aid memoir: think that our cynical society rejects before it accepts. The significance level (alpha) is the probability of type I error. The power of a test is one minus the probability of type II error (beta). Power should be maximised when selecting statistical methods. If you want to estimate sample sizes then you must understand all of the terms mentioned here. The following table shows the relationship between power and error in hypothesis testing: DECISION TRUTH Accept H 0 Reject H 0 H 0 is true correct decision P type I error P 1-alpha alpha (significance) H 0 is false type II error P correct decision P beta 1-beta (power) H 0 = null hypothesis P = probability If you are interested in further details of probability and sampling theory at this point then please refer to one of the general texts listed in the reference section. You must understand confidence intervals if you intend to quote P values in reports and papers. Statistical referees of scientific journals expect authors to quote confidence intervals with greater prominence than P values. Notes about Type I error: is the incorrect rejection of the null hypothesis maximum probability is set in advance as alpha is not affected by sample size as it is set in advance increases with the number of tests or end points (i.e. do 20 tests and 1 is likely to be wrongly significant) Notes about Type II error: is the incorrect acceptance of the null hypothesis probability is beta beta depends upon sample size and alpha can't be estimated except as a function of the true population effect beta gets smaller as the sample size gets larger beta gets smaller as the number of tests or end points increases

4 How to Work with G-Power by an Example

5

6

7

8

9

10

11 Sample Size Determination in G-Power by Examples: In this section we prepared the input quantities needed for each goal of sample size determination and output by GPower is presented. For more details on statistical and technical issues of the computations see Faul et. al. (2007): 1) Exact Tests Exact - Correlation: Bivariate normal model Options: exact distribution Correlation ρ H1 = 0.3 Correlation ρ H0 = 0 Output: Lower critical r = Upper critical r = Total sample size = 115 Actual power = Exact - Linear multiple regression: Random model Options: Exact distribution H1 ρ² = 0.3 H0 ρ² = 0 Number of predictors = 3 Output: Lower critical R² = Upper critical R² = Total sample size = 49 Actual power = Exact - Proportion: Difference from constant (binomial test, one sample case)

12 Effect size g = 0.3 Constant proportion = 0.5 Output: Lower critical N = Upper critical N = Total sample size = 28 Actual power = Actual α = Exact - Proportions: Inequality, two dependent groups (McNemar) Options: approximation Odds ratio = 1.5 Prop discordant pairs = 0.3 Output: Lower critical N = 148 Upper critical N = 148 Total sample size = 894 Actual power = Actual α = Proportion p12 = Proportion p21 = Exact - Proportions: Inequality, two independent groups (Fisher's exact test) Options: Exact distribution Proportion p1 = 0.5 Proportion p2 = 0.6 Allocation ratio N2/N1 = 1 Output: Sample size group 1 = 557 Sample size group 2 = 557 Total sample size = 1114 Actual power = Actual α = Exact - Proportions: Inequality, two independent groups (unconditional) Options: z-test pooled

13 Odds ratio = Proportion p2 = 0.5 Allocation ratio N2/N1 = 1 Output: Sample size group 1 = 233 Sample size group 2 = 233 Total sample size = 466 Actual α = Actual power = Exact - Proportions: Inequality (offset), two independent groups (unconditional) Options: proportion, z-test pooled Prop. p1 H1 = 0.65 Prop. p1 H0 = 0.55 Proportion p2 = 0.5 Allocation ratio N2/N1 = 1 Output: Sample size group 1 = 524 Sample size group 2 = 524 Total sample size = 1048 Actual α = Actual power = Exact - Proportion: Sign test (binomial test) Effect size g = 0.3 Output: Lower critical N = Upper critical N = Total sample size = 28 Actual power = Actual α = Exact - Generic binomial test Proportion p2 = 0.8

14 Proportion p1 = 0.5 Output: Lower critical N = Upper critical N = Total sample size = 28 Actual power = Actual α = ) F tests F tests - ANCOVA: Fixed effects, main effects and interactions Input: Effect size f = 0.25 Numerator df = 10 Number of groups = 5 Number of covariates = 1 Output: Noncentrality parameter λ = Critical F = Denominator df = 394 Total sample size = 400 Actual power = F tests - ANOVA: Fixed effects, omnibus, one-way Input: Effect size f = 0.25 Number of groups = 5 Output: Noncentrality parameter λ = Critical F = Numerator df = 4 Denominator df = 300 Total sample size = 305 Actual power = F tests - ANOVA: Fixed effects, special, main effects and interactions Input: Effect size f = 0.25

15 Numerator df = 10 Number of groups = 5 Output: Noncentrality parameter λ = Critical F = Denominator df = 395 Total sample size = 400 Actual power = F tests - ANOVA: Repeated measures, between factors Input: Effect size f = 0.25 Number of groups = 2 Number of measurements = 4 Corr among rep measures = 0.5 Output: Noncentrality parameter λ = Critical F = Numerator df = Denominator df = 130 Total sample size = 132 Actual power = F tests - ANOVA: Repeated measures, within factors Input: Effect size f = 0.25 Number of groups = 2 Number of measurements = 4 Corr among rep measures = 0.5 Nonsphericity correction ε = 1 Output: Noncentrality parameter λ = Critical F = Numerator df = Denominator df = 102 Total sample size = 36 Actual power = F tests - ANOVA: Repeated measures, within-between interaction Input: Effect size f = 0.25 Number of groups = 2 Number of measurements = 4 Corr among rep measures = 0.5

16 Nonsphericity correction ε = 1 Output: Noncentrality parameter λ = Critical F = Numerator df = Denominator df = 102 Total sample size = 36 Actual power = F tests - Hotellings T²: One group mean vector Input: Effect size Δ = 0.15 Response variables = 2 Output: Noncentrality parameter λ = Critical F = Numerator df = 2 Denominator df = 688 Total sample size = 690 Actual power = F tests - Hotellings T²: Two group mean vectors Input: Effect size Δ = 0.15 Allocation ratio N2/N1 = 1 Response variables = 2 Output: Noncentrality parameter λ = Critical F = Numerator df = 2 Denominator df = 2747 Sample size group 1 = 1375 Sample size group 2 = 1375 Actual power = F tests - MANOVA: Global effects Options: Pillai V, O'Brien-Shieh Algorithm Input: Effect size f²(v) = 0.25 Number of groups = 3 Response variables = 2 Output: Noncentrality parameter λ = Critical F =

17 Numerator df = Denominator df = Total sample size = 42 Actual power = Pillai V = F tests - MANOVA: Special effects and interactions Options: Pillai V, O'Brien-Shieh Algorithm Input: Effect size f²(v) = 0.25 Number of groups = 6 Number of predictors = 3 Response variables = 2 Output: Noncentrality parameter λ = Critical F = Numerator df = Denominator df = Total sample size = 46 Actual power = Pillai V = F tests - MANOVA: Repeated measures, between factors Options: Pillai V, O'Brien-Shieh Algorithm Input: Effect size f = 0.25 Number of groups = 2 Number of measurements = 4 Corr among rep measures = 0 Output: Noncentrality parameter λ = Critical F = Numerator df = Denominator df = Total sample size = 54 Actual power = Pillai V = F tests - MANOVA: Repeated measures, within factors Options: Pillai V, O'Brien-Shieh Algorithm Input: Effect size f = 0.25 Number of groups = 2

18 Number of measurements = 4 Corr among rep measures = 0 Output: Noncentrality parameter λ = Critical F = Numerator df = Denominator df = Total sample size = 74 Actual power = Pillai V = F tests - MANOVA: Repeated measures, within-between interaction Options: Pillai V, O'Brien-Shieh Algorithm Input: Effect size f(v) = 0.25 Number of groups = 2 Number of measurements = 4 Output: Noncentrality parameter λ = Critical F = Numerator df = Denominator df = 275 Total sample size = 279 Actual power = Pillai V = F tests - Linear multiple regression: Fixed model, R² deviation from zero Input: Effect size f² = 0.15 Number of predictors = 2 Output: Noncentrality parameter λ = Critical F = Numerator df = 2 Denominator df = 104 Total sample size = 107 Actual power = F tests - Linear multiple regression: Fixed model, R² increase Input: Effect size f² = 0.15

19 Number of tested predictors = 2 Total number of predictors = 5 Output: Noncentrality parameter λ = Critical F = Numerator df = 2 Denominator df = 101 Total sample size = 107 Actual power = F tests - Variance: Test of equality (two sample case) Ratio var1/var0 = 1.5 Allocation ratio N2/N1 = 1 Output: Lower critical F = Upper critical F = Numerator df = 265 Denominator df = 265 Sample size group 1 = 266 Sample size group 2 = 266 Actual power = F tests - Generic F test Analysis: Compromise: Compute α and β Input: Noncentrality parameter λ = 30 β/α ratio = 1 Numerator df = 10 Denominator df = 20 Output: Critical F = α err prob = β err prob = Power (1-β err prob) =

20 3) t tests t tests - Correlation: Point biserial model Effect size ρ = Power (1-β err prob) = Output: Noncentrality parameter δ = Critical t = Df = 98 Total sample size = 100 Actual power = t tests - Linear bivariate regression: One group, size of slope Slope H1 = 0.15 α err prob = Power (1-β err prob) = Slope H0 = 0 Std dev σ_x = 1 Std dev σ_y = 1 Output: Noncentrality parameter δ = Critical t = Df = 98 Total sample size = 100 Actual power = t tests - Linear bivariate regression: Two groups, difference between intercepts Δ intercept = 2 Allocation ratio N2/N1 = 1 Std dev residual σ = 2 Std dev σ_x1 = 4 Std dev σ_x2 = 3 Mean μ_x1 = 5 Mean μ_x2 = 7 Output: Noncentrality parameter δ = Critical t = Df = 194

21 Sample size group 1 = 99 Sample size group 2 = 99 Total sample size = 198 Actual power = t tests - Linear bivariate regression: Two groups, difference between slopes Δ slope = α err prob = Power (1-β err prob) = Allocation ratio N2/N1 = 1 Std dev residual σ = 0.5 Std dev σ_x1 = 5 Std dev σ_x2 = 10 Output: Noncentrality parameter δ = Critical t = Df = 98 Sample size group 1 = 51 Sample size group 2 = 51 Total sample size = 102 Actual power = t tests - Linear multiple regression: Fixed model, single regression coefficient Effect size f² = 0.15 α err prob = Power (1-β err prob) = Number of predictors = 2 Output: Noncentrality parameter δ = Critical t = Df = 97 Total sample size = 100 Actual power = t tests - Means: Difference between two dependent means (matched pairs) Effect size dz = 0.5 α err prob = Power (1-β err prob) =

22 Output: Noncentrality parameter δ = Critical t = Df = 100 Total sample size = 101 Actual power = t tests - Means: Difference between two independent means (two groups) Effect size d = 0.5 α err prob = Power (1-β err prob) = Allocation ratio N2/N1 = 1 Output: Noncentrality parameter δ = Critical t = Df = 100 Sample size group 1 = 51 Sample size group 2 = 51 Total sample size = 102 Actual power = t tests - Means: Difference from constant (one sample case) Effect size d = 0.5 α err prob = Power (1-β err prob) = Output: Noncentrality parameter δ = Critical t = Df = 100 Total sample size = 101 Actual power = t tests - Means: Wilcoxon signed-rank test (matched pairs) Options: A.R.E. method Parent distribution = Normal Effect size dz = 0.5 α err prob = Power (1-β err prob) = Output: Noncentrality parameter δ = Critical t = Df = Total sample size = 101 Actual power =

23 t tests - Means: Wilcoxon signed-rank test (one sample case) Options: A.R.E. method Parent distribution = Normal Effect size d = 0.5 α err prob = Power (1-β err prob) = Output: Noncentrality parameter δ = Critical t = Df = Total sample size = 101 Actual power = t tests - Means: Wilcoxon-Mann-Whitney test (two groups) Options: A.R.E. method Parent distribution = Normal Effect size d = 0.5 α err prob = Power (1-β err prob) = Allocation ratio N2/N1 = 1 Output: Noncentrality parameter δ = Critical t = Df = Sample size group 1 = 51 Sample size group 2 = 51 Total sample size = 102 Actual power = t tests - Generic t test Analysis: Compromise: Compute α and β Noncentrality parameter δ = 10 β/α ratio = 1 Df = 1 Output: Critical t = α err prob = β err prob = Power (1-β err prob) =

24 4) χ² tests χ² tests - Goodness-of-fit tests: Contingency tables Input: Effect size w = 0.3 α err prob = Power (1-β err prob) = Df = 5 Output: Noncentrality parameter λ = Critical χ² = Total sample size = 100 Actual power = χ² tests - Variance: Difference from constant (one sample case) Ratio var1/var0 = 1.5 α err prob = Power (1-β err prob) = Output: Lower critical χ² = Upper critical χ² = Df = 99 Total sample size = 100 Actual power = χ² tests - Generic χ² test Analysis: Compromise: Compute α and β Input: Noncentrality parameter λ = 20 β/α ratio = 1 Df = 5 Output: Critical χ² = α err prob = β err prob =

25 5) z tests z tests - Correlation: Tetrachoric model Options: Exact r H1 corr ρ = 0.1 α err prob = Power (1-β err prob) = H0 corr ρ = -0.1 Marginal prop x = 0.6 Marginal prop y = 0.3 Output: Critical z = Total sample size = 100 Actual power = H1 corr ρ = H0 corr ρ = Critical r lwr = Critical r upr = Std err r = z tests - Correlations: Two dependent Pearson r's (common index) H1 corr ρ_ac = -0.1 α err prob = Power (1-β err prob) = H0 corr ρ_ab = 0.1 Corr ρ_bc = -0.1 Output: Critical z = Sample size = 101 Actual power = z tests - Correlations: Two dependent Pearson r's (no common index) H1 corr ρ_cd = 0.15 α err prob = Power (1-β err prob) = H0 corr ρ_ab = 0.1 Corr ρ_ac = 0.7 Corr ρ_ad = 0.6 Corr ρ_bc = 0.4 Corr ρ_bd = 0.41

26 Output: Critical z = Sample size = 100 Actual power = z tests - Correlations: Two independent Pearson r's Effect size q = 0.6 α err prob = Power (1-β err prob) = Allocation ratio N2/N1 = 1 Output: Critical z = Sample size group 1 = 51 Sample size group 2 = 51 Total sample size = 102 Actual power = z tests - Logistic regression Options: Large sample z-test, Demidenko (2007) with var corr Odds ratio = 1.3 Pr(Y=1 X=1) H0 = 0.2 α err prob = Power (1-β err prob) = R² other X = 0 X distribution = Normal X parm μ = 0 X parm σ = 1 Output: Critical z = Total sample size = 101 Actual power = z tests - Poisson regression Options: Large sample z-test, Demidenko (2007) with var corr Exp(β1) = 1.3 Base rate exp(β0) = 0.85 Mean exposure = 1 R² other X = 0 X distribution = Normal X parm μ = 0 X parm σ = 1 Output: Critical z =

27 Total sample size = 179 Actual power = z tests - Proportions: Difference between two independent proportions Proportion p2 = 0.6 Proportion p1 = 0.5 Allocation ratio N2/N1 = 1 Output: Critical z = Sample size group 1 = 533 Sample size group 2 = 533 Total sample size = 1066 Actual power = z tests - Generic z test Analysis: Compromise: Compute α and β Noncentrality parameter μ = 3 Noncentral dist. SD σ = 1 β/α ratio = 1 Output: Critical z = α err prob = β err prob = Power (1-β err prob) = Sample Size Determination in STATA by Examples: 1) Comparing survival in groups

28

29 . stpower logrank Estimated sample sizes for two-sample comparison of survivor functions Log-rank test, Freedman method Ho: S1(t) = S2(t) Input parameters: alpha = (two sided) hratio = power =

30 p1 = Estimated number of events and sample sizes: E = 72 N = 72 N1 = 36 N2 = 36 2) Cox Regression

31

32 . stpower cox, wdprob(.4) Estimated sample size for Cox PH regression Wald test, log-hazard metric Ho: [b1, b2,..., bp] = [0, b2,..., bp] Input parameters: alpha = (two sided) b1 = sd = power = withdrawal(%) = 40.00

33 Estimated number of events and sample size: E = 66 N = 109 Sample Size Determination in PASS by Examples: Some web based sample size calculators: Based on Power: alculators.aspx Based on CI

34 References Armitage P, Berry G, Matthews JNS. Statistical Methods in Medical Research (4th edition). Oxford: Blackwell Science Bland M. An Introduction to Medical Statistics (3rd edition). Oxford Medical Publications Casagrande JT, Pike MC, Smith PG. An improved approximate formula for calculating sample sizes for comparing two binomial distributions. Biometrics 1978;34: Donner A, Eliasziw M. A goodness of fit approach to inference procedures for the Kappa statistic: CI construction, significance testing and sample size estimation. Statistics in Medicine 1992;11: Dupont WD. Power calculations for matched case-control studies. Biometrics 1988;44: Dupont WD. Power and sample size calculations. Controlled Clinical Trials 1990;11: Fleiss JL. Statistical Methods for Rates and Proportions (2nd edition). New York: Wiley Gardner MJ, Altman DG. Statistics with Confidence - Confidence Intervals and Statistical Guidelines. British Medical Journal Pearson & Hartley. Biometrika tables for statisticians (Volumes I and II, 3rd edition). Cambridge University Press Franz, Erdfelder, Lang and Buchner. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods 2007, 39 (2),