Catherine Bresee, MS Senior Biostatistician Biostatistics & Bioinformatics Research Institute Statistics in Medicine Research Lecture Series CSMC Fall 2014
Overview Review concept of statistical power Factors influencing power Examples of sample size calculations(test of means, test of proportions) Writing a sample size justification Software tutorial
Power in Statistical Testing Metaphorically, the concept of statistical power is like a magnifying glass. The more powerful a magnifying glass is, the greater the ability to show greater detail. A more powerful study can better reveal a significant result.
Why do a Power Analysis? Optimize sample size to economize on costs. Determine the sample size for a clinically relevant and statistically significant difference. Determine the minimally detectable difference for a fixed sample size. Determine if the study is justifiably worth doing. Required for NIH Grant Proposals IRB / IACUC Protocols
Hypothesis Testing
Hypothesis Testing Typically statistical testing is to disprove a falsehood. Goal: Proving two groups are different. More challenging to prove equality! (Tests of equivalence require a different approach than typical statistical testing and much larger sample size.) Similar to Presumed innocent until proven guilty We assume treatment groups are not different until statistically proven different.
THE TRUTH IN NATURE True Null Hypothesis No Difference Between Groups True Alternative Hypothesis Difference Between Groups What Your Experiment Observes No Difference Between Groups (fail to reject Ho) Difference Between Groups (reject Ho) Correct Outcome! True Negative False Positive Type I Error α False Negative Type II Error β Correct Outcome! True Positive
Type I Error Type I Error: reject a null hypothesis when in fact it is true (False Positive) Probability of False Positive = α α = significance level P value: Probability of observing a given set of results as extreme (or more) than possible by chance alone if the null hypothesis is true. Smaller p value, stronger the evidence against the null If computed p value < α then reject null hypothesis
Type II Error Type II Error: when an experiment concludes that you cannot reject a null hypothesis when in fact the null hypothesis is incorrect (False Negative) Probability of False Negative = β 1 β = POWER Power: Probability of a true positive result. Power = 1 P(Type II Error) = 1 β Measure of a statistical test s ability to correctly reject a false null hypothesis. Type I Error is often considered more serious than Type II Error
Null Hypothesis There is no wolf chasing the sheep. Type I Error (False Positive) Shepherd cries out Wolf, Wolf when there is no wolf. Type II Error (False Negative) There is a wolf, but shepherd is too distracted to notice. Cost Assessment Help will not arrive next time shepherd cries wolf since no one believes him. One of the sheep might get eaten.
Null Hypothesis New drug is not better than old drug. Scenario for Type I Error (False Positive) Experimental group treated with new drug just happens to do better than the group treated with old drug, but the new drug really does not work that well. Scenario for Type II Error (False Negative) Experimental group treated with new drug does not do better than the group treated with old drug, but the new drug really does work better than old drug. Cost Assessment Drug goes on to treat more patients that don t get better and die. Study stops and the new better drug is abandoned without further testing.
Motivating Example Mice treated with a know carcinogen develop colon tumors 6 months later. Experiment will test a new transgenic knock out mouse to see if gene of interest is involved in the pathway for colon cancer. New transgenic mouse should have more severe colon cancer. How many mice to use per group?
There is no magic number, no perfect sample size! The chosen sample size is a function of many factors.
Six Components to a Statistical Power Analysis Typically set 5 items as fixed and solve for the 6 th.
1. Chose the type of Statistical Test What is the data (probably) going to look like? What statistical test for your primary hypothesis? Continuous & normally distributed > Student s t test, ANOVA, Linear Regression, Correlation Continuous & skewed > Wilcoxon Rank Sum Test, Mann Whitney U test Counts & proportions > Chi square, Fisher s Exact test Survival time > Kaplan Meier test, Log rank test Rates > Poisson regression
Parametric tests (e.g., t test) have more power than non parametric tests (rank sum test). Continuous data has more power than categorical. Avoid cut points or dichotomization of continuous data. Altman, Douglas G., and Patrick Royston. "The cost of dichotomising continuous variables." BMJ 332 (2006): 1080. MacCallum, Robert C., et al. "On the practice of dichotomization of quantitative variables." Psychological Methods 7(2002): 19.
2. Choose your Significance Level Set your α level. What test level is required? 0.10 0.05 0.01 0.001 One sided or two sided? Conventional choice is two sided α = 0.05 Any value can be selected, justification needed. Pilot projects sometimes select higher α
Two Sided Test : = : Mean=0 SD=0.5 Mean=1 SD=0.25
One Sided Test : : < Mean=0 SD=0.5 Mean=1 SD=0.25 In general, most research is done with a two sided test (more conservative assumption). If one sided is chosen, then cut alpha in half and set to 0.025
3. Expected Variation Standard deviation (SD) can be estimated by Previous published research Observed SD from a control group in another experiment. Coefficient of Variation, i.e. percent of mean Estimated range (min, max) Typically C=4, based on convention Range is a poor substitute for having an estimated SD, use only as last resort.
Variation Normal Distribution Large amounts of variation can effect precision in estimating the difference between group means. Larger sample size will be required with larger variation 2 ±1 vs 2 ±2 n=6 n=6 2 ±3 vs 2 ±3 n=6 n=6 2 ±3 vs 2 ±3 N=10 n=10
Variation Non normal Distribution Categorical data the amount of variation is dependent on the size of the sample and the expected percent for each category. Skewed data where possibility of extreme outliers, then need the range (min, max) and median (50 th percentile) Survival data need median survival time and percent of censored observations (still alive at end of study)
4. Minimum Detectable Difference Difference between group means, group proportions Larger the difference between groups, smaller the required sample size. Typically, the more groups to compare, larger the required sample size per group.
What is Effect Size? Measurement of the magnitude of difference between groups as a function of the amount of variation. 2 groups effect size: >2 groups effect size: In ANOVA Small effect size = 0.1 Large effect size = 0.4 2 for the regression model
5. Power Power is the probability an experiment will find a true difference between groups if it exists. Anything greater than 80% generally acceptable. Typically set at 80% or 90% when fixed value used in computing sample size. Higher power requires larger sample size
6. Sample Size What is cost effective, reasonable to work with? Unbalanced sample size is acceptable. Treatment group might have more variation requiring larger sample size than control group One group might not be as available. Should use more careful estimates to calculate true power.
Factors Affecting Power Parametric tests Multiple groups to compare Increased magnitude of difference between groups Increased variation in the sample Bigger sample size Smaller p value required for statistical significance
Writing a Power Analysis Must include all 6 elements: 1. Type of statistical test to be used 2. Significance level (one sided or two sided) 3. Estimated standard deviation 4. Minimum detectable difference 5. Expected Power 6. Sample size per treatment group
Motivating Example Mice treated with a know carcinogen develop colon tumors 6 months later. Experiment will test new transgenic knockout mice to determine if the gene of interest is involved in the cancer pathway. Transgenic mice should get sicker than control mice if this gene is important. PI s question: How many mice to use per group? Is there any published data on control animals?
Example 100% of the mice treated with the carcinogen develop colon tumors. Mean ± SD number of tumors per mouse = 5.0 ± 2.9 Mean ± SD tumor size per tumor = 3.5 ± 0.13 mm PI would prefer to work with no more than 10 mice per group for cost. New transgenic mouse to be studied should have worse colon cancer if gene knocked out is part of colon cancer pathway.
Null Hypothesis Transgenic mouse has colon cancer just as bad as control group. Cost Assessment Scenario for Type I Error (False Positive) Transgenic mouse just happens to show worse colon cancer in this experiment, but the new mouse strain really is not any different than regular mouse. Results get published, but then retracted later when future studies show futility. Scenario for Type II Error (False Negative) Transgenic mice does not get any sicker than control mice in our experiment, but the transgenic mouse really does have a higher rate of colon cancer. Study stops and the no new experiments are planned with the new mouse breed; other genes are studied.
Null Hypothesis Transgenic mouse has colon cancer just as bad as control group. Cost Assessment Scenario for Type I Error (False Positive) Transgenic mouse just happens to show worse colon cancer in this experiment, but the new mouse strain really is not any different than regular mouse. Results get published, but then retracted later when future studies show futility. Scenario for Type II Error (False Negative) Transgenic mice does not get any sicker than control mice in our experiment, but the transgenic mouse really does have a higher rate of colon cancer. Study stops and the no new experiments are planned with the new mouse breed; other genes are studied. Risk of this Scenario = α Set to 5% Risk of this Scenario = β Set to 20%
Final Power Analysis for NIH Grant For Experiment 2.3: With this treatment (AOM 10 mg/kg i.p., once a week for 4 weeks) mice usually develop multiple colon tumors with 100% penetrance. When mice are sacrificed we will count the number of tumors and the average size of the tumor per animals. According to published data mice develop ~5 (sd=3) tumors/mouse with the average size 3.5mm (sd=0.13). 32 We will sacrifice these mice 5 weeks after the first injection. We expect that Car1/GH/IRES/GFP+/+ mice develop more tumors and the size of tumors will be bigger. We have therefore chosen 6 animals per group to achieve 80% power to detect an average increase of 5 tumors per mouse and 0.2mm increase in average size in a twosided two sample t test at the 0.05 significance level.
Example, part 2 Test of new drug to potentially stop colon cancer using same mouse model. 100% of mice treated with a know carcinogen develop colon tumors 6 months later. How many mice do we need to test a decrease of colon cancer incidence rate?
Power Analysis Results: Based on the assumption that our control group of mice will have an incidence rate at ~100% in developing colon cancer, a future study of the new drug with 10 animal in each group will have 80% power in a two sided Fisher s Exact Test at the 0.05 significance level, assuming the treatment group will have an incidence level of 40% or lower.
All NIH grants, IRB & IACUC protocols are reviewed by a biostatistician who know what a properly written sample size justification should look like.
Some Common Mistakes Common misconception: p<0.05 means the probability that the null hypothesis is true is less than 0.05. Correct interpretation: smaller p value indicates the weight of the evidence against the null hypothesis is stronger, and not simply a random chance. It is NOT more statistically significant. Differences are either statistically significant or they are not.
Using Pilot Data Inappropriately Pilot data should be an independent collection of data to analyze in planning a future study It should not be a small sample that failed to demonstrate significant differences and more cases will be added to. If plan is to increase sample size for initial smaller failed experiment, decision should be made a priori, and plan for interim analysis, alpha spending rules, etc.
More Common Mistakes Common misconception: Doubling the sample cuts the standard deviation in half One sided versus two sided Non random sampling Dichotomizing or classifying data extracted from a continuous variable
Cost of Dichotomization
The Error of Post Hoc Power Analysis When a study has negative results it is inappropriate to calculate how much power the study had. All statistical tests that have p>0.05 will have poor observed power using the data at hand. The implied conclusion of post hoc power is that the effect observed could be real but the sample size was too small. However, the observed mean will vary with each trial. Instead the confidence interval should be computed for the observed difference from groups and interpreted as such. Hoenig, John M., and Dennis M. Heisey. "The abuse of power." The American Statistician 55.1 (2001).
Simple Formula for Difference Between two Means (not appropriate for small sample sizes) Sample size in each group (assumes equal sized groups) n Standard deviation of the outcome variable Represents the desired power (typically.84 for 80% power). 2 2 ( Z Z /2 2 difference Effect Size (the difference in means) ) 2 Represents the desired level of statistical significance (typically 1.96)
Simple Formula for Difference Between two Proportions (not appropriate for small sample sizes) Sample size in each group (assumes equal sized groups) Represents the desired power (typically.84 for 80% power). n A measure of variability (similar to standard deviation) 2( p)(1 (p p)( Z 1 p 2 Effect Size (the difference in proportions) ) 2 Z /2 ) 2 Represents the desired level of statistical significance (typically 1.96).
Online Calculators
Calculations are based on formulas for the 1 sample Z test
Calculations are based on formulas for the 1 sample Z test
Online Calculators
PS: Power and Sample Size Demo Free PS software (v 3.1.2) available at: http://biostat.mc.vanderbilt.edu/wiki/main/powersam plesize Dupont WD, Plummer WD: 'Power and Sample Size Calculations: A Review and Computer Program', Controlled Clinical Trials 1990; 11:116 28.
Step 1: Chose your type of statistical test
Step 2: Choose what to solve for Step 3: Input parameters 2 group t test significance level = 0.05 standard deviation = 2.9 difference in group means = 4 power = 0.80 m= ratio of samples in each group
Auto generated sample size justification paragraph!!
G*Power Demo Free software (v 3.1.9.2) available at: http://www.gpower.hhu.de/en.html Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175 191.
Excellent G*Power Tutorial: http://www.ats.ucla.edu/stat/gpower/
G*Power Mann Whitney (rank test) Non parametric test of two group means
Non parametric test of two group means
G*Power ANOVA ANOVA 3 groups 10 per group Alpha = 0.05 Power = 0.80
Minimum detectable effect size = 0.6
Absence of Evidence is not Evidence of Absence
http://www.youtube.com/watch?v=pbodigczql8