Statistics in Medicine Research Lecture Series CSMC Fall 2014

Similar documents

Sample Size Planning, Calculation, and Justification

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.

Final Exam Practice Problem Answers

Descriptive Statistics

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

II. DISTRIBUTIONS distribution normal distribution. standard scores

Biostatistics: Types of Data Analysis

Study Design and Statistical Analysis

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Part 2: Analysis of Relationship Between Two Variables

Version 4.0. Statistics Guide. Statistical analyses for laboratory and clinical researchers. Harvey Motulsky

Principles of Hypothesis Testing for Public Health

Correlational Research

Research Methods & Experimental Design

Additional sources Compilation of sources:

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistics Review PSY379

Simple Linear Regression Inference

Non-Inferiority Tests for Two Means using Differences

DATA INTERPRETATION AND STATISTICS

UNIVERSITY OF NAIROBI

Introduction to Quantitative Methods

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Sample Size and Power in Clinical Trials

5. Linear Regression

Point Biserial Correlation Tests

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

Statistical tests for SPSS

BIOM611 Biological Data Analysis

1-3 id id no. of respondents respon 1 responsible for maintenance? 1 = no, 2 = yes, 9 = blank

Introduction to Statistics and Quantitative Research Methods

Data Analysis, Research Study Design and the IRB

2 Precision-based sample size calculations

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Study Guide for the Final Exam

Section 13, Part 1 ANOVA. Analysis Of Variance

BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp , ,

Normality Testing in Excel

Tutorial 5: Hypothesis Testing

Come scegliere un test statistico

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests

The Statistics Tutor s Quick Guide to

Standard Deviation Estimator

Introduction to Statistics with GraphPad Prism (5.01) Version 1.1

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

A POPULATION MEAN, CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

CALCULATIONS & STATISTICS

THE KRUSKAL WALLLIS TEST

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

Univariate Regression

22. HYPOTHESIS TESTING

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Generalized Linear Models

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Non-Parametric Tests (I)

List of Examples. Examples 319

Some Essential Statistics The Lure of Statistics

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Parametric and Nonparametric: Demystifying the Terms

Example: Boats and Manatees

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

Likelihood Approaches for Trial Designs in Early Phase Oncology

Simple Regression Theory II 2010 Samuel L. Baker

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

Skewed Data and Non-parametric Methods

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Chapter 7 Section 1 Homework Set A

Projects Involving Statistics (& SPSS)

HYPOTHESIS TESTING WITH SPSS:

Statistical issues in the analysis of microarray data

Pearson's Correlation Tests

Non-Inferiority Tests for One Mean

Introduction to Hypothesis Testing

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Analysis and Interpretation of Clinical Trials. How to conclude?

NCSS Statistical Software

Permutation Tests for Comparing Two Populations

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Guide to Biostatistics

Testing Hypotheses About Proportions

Introduction to Hypothesis Testing OPRE 6301

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Sample size estimation is an important concern

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Design and Analysis of Phase III Clinical Trials

Sample Size Determination in Clinical Trials HRM-733 CLass Notes

Transcription:

Catherine Bresee, MS Senior Biostatistician Biostatistics & Bioinformatics Research Institute Statistics in Medicine Research Lecture Series CSMC Fall 2014

Overview Review concept of statistical power Factors influencing power Examples of sample size calculations(test of means, test of proportions) Writing a sample size justification Software tutorial

Power in Statistical Testing Metaphorically, the concept of statistical power is like a magnifying glass. The more powerful a magnifying glass is, the greater the ability to show greater detail. A more powerful study can better reveal a significant result.

Why do a Power Analysis? Optimize sample size to economize on costs. Determine the sample size for a clinically relevant and statistically significant difference. Determine the minimally detectable difference for a fixed sample size. Determine if the study is justifiably worth doing. Required for NIH Grant Proposals IRB / IACUC Protocols

Hypothesis Testing

Hypothesis Testing Typically statistical testing is to disprove a falsehood. Goal: Proving two groups are different. More challenging to prove equality! (Tests of equivalence require a different approach than typical statistical testing and much larger sample size.) Similar to Presumed innocent until proven guilty We assume treatment groups are not different until statistically proven different.

THE TRUTH IN NATURE True Null Hypothesis No Difference Between Groups True Alternative Hypothesis Difference Between Groups What Your Experiment Observes No Difference Between Groups (fail to reject Ho) Difference Between Groups (reject Ho) Correct Outcome! True Negative False Positive Type I Error α False Negative Type II Error β Correct Outcome! True Positive

Type I Error Type I Error: reject a null hypothesis when in fact it is true (False Positive) Probability of False Positive = α α = significance level P value: Probability of observing a given set of results as extreme (or more) than possible by chance alone if the null hypothesis is true. Smaller p value, stronger the evidence against the null If computed p value < α then reject null hypothesis

Type II Error Type II Error: when an experiment concludes that you cannot reject a null hypothesis when in fact the null hypothesis is incorrect (False Negative) Probability of False Negative = β 1 β = POWER Power: Probability of a true positive result. Power = 1 P(Type II Error) = 1 β Measure of a statistical test s ability to correctly reject a false null hypothesis. Type I Error is often considered more serious than Type II Error

Null Hypothesis There is no wolf chasing the sheep. Type I Error (False Positive) Shepherd cries out Wolf, Wolf when there is no wolf. Type II Error (False Negative) There is a wolf, but shepherd is too distracted to notice. Cost Assessment Help will not arrive next time shepherd cries wolf since no one believes him. One of the sheep might get eaten.

Null Hypothesis New drug is not better than old drug. Scenario for Type I Error (False Positive) Experimental group treated with new drug just happens to do better than the group treated with old drug, but the new drug really does not work that well. Scenario for Type II Error (False Negative) Experimental group treated with new drug does not do better than the group treated with old drug, but the new drug really does work better than old drug. Cost Assessment Drug goes on to treat more patients that don t get better and die. Study stops and the new better drug is abandoned without further testing.

Motivating Example Mice treated with a know carcinogen develop colon tumors 6 months later. Experiment will test a new transgenic knock out mouse to see if gene of interest is involved in the pathway for colon cancer. New transgenic mouse should have more severe colon cancer. How many mice to use per group?

There is no magic number, no perfect sample size! The chosen sample size is a function of many factors.

Six Components to a Statistical Power Analysis Typically set 5 items as fixed and solve for the 6 th.

1. Chose the type of Statistical Test What is the data (probably) going to look like? What statistical test for your primary hypothesis? Continuous & normally distributed > Student s t test, ANOVA, Linear Regression, Correlation Continuous & skewed > Wilcoxon Rank Sum Test, Mann Whitney U test Counts & proportions > Chi square, Fisher s Exact test Survival time > Kaplan Meier test, Log rank test Rates > Poisson regression

Parametric tests (e.g., t test) have more power than non parametric tests (rank sum test). Continuous data has more power than categorical. Avoid cut points or dichotomization of continuous data. Altman, Douglas G., and Patrick Royston. "The cost of dichotomising continuous variables." BMJ 332 (2006): 1080. MacCallum, Robert C., et al. "On the practice of dichotomization of quantitative variables." Psychological Methods 7(2002): 19.

2. Choose your Significance Level Set your α level. What test level is required? 0.10 0.05 0.01 0.001 One sided or two sided? Conventional choice is two sided α = 0.05 Any value can be selected, justification needed. Pilot projects sometimes select higher α

Two Sided Test : = : Mean=0 SD=0.5 Mean=1 SD=0.25

One Sided Test : : < Mean=0 SD=0.5 Mean=1 SD=0.25 In general, most research is done with a two sided test (more conservative assumption). If one sided is chosen, then cut alpha in half and set to 0.025

3. Expected Variation Standard deviation (SD) can be estimated by Previous published research Observed SD from a control group in another experiment. Coefficient of Variation, i.e. percent of mean Estimated range (min, max) Typically C=4, based on convention Range is a poor substitute for having an estimated SD, use only as last resort.

Variation Normal Distribution Large amounts of variation can effect precision in estimating the difference between group means. Larger sample size will be required with larger variation 2 ±1 vs 2 ±2 n=6 n=6 2 ±3 vs 2 ±3 n=6 n=6 2 ±3 vs 2 ±3 N=10 n=10

Variation Non normal Distribution Categorical data the amount of variation is dependent on the size of the sample and the expected percent for each category. Skewed data where possibility of extreme outliers, then need the range (min, max) and median (50 th percentile) Survival data need median survival time and percent of censored observations (still alive at end of study)

4. Minimum Detectable Difference Difference between group means, group proportions Larger the difference between groups, smaller the required sample size. Typically, the more groups to compare, larger the required sample size per group.

What is Effect Size? Measurement of the magnitude of difference between groups as a function of the amount of variation. 2 groups effect size: >2 groups effect size: In ANOVA Small effect size = 0.1 Large effect size = 0.4 2 for the regression model

5. Power Power is the probability an experiment will find a true difference between groups if it exists. Anything greater than 80% generally acceptable. Typically set at 80% or 90% when fixed value used in computing sample size. Higher power requires larger sample size

6. Sample Size What is cost effective, reasonable to work with? Unbalanced sample size is acceptable. Treatment group might have more variation requiring larger sample size than control group One group might not be as available. Should use more careful estimates to calculate true power.

Factors Affecting Power Parametric tests Multiple groups to compare Increased magnitude of difference between groups Increased variation in the sample Bigger sample size Smaller p value required for statistical significance

Writing a Power Analysis Must include all 6 elements: 1. Type of statistical test to be used 2. Significance level (one sided or two sided) 3. Estimated standard deviation 4. Minimum detectable difference 5. Expected Power 6. Sample size per treatment group

Motivating Example Mice treated with a know carcinogen develop colon tumors 6 months later. Experiment will test new transgenic knockout mice to determine if the gene of interest is involved in the cancer pathway. Transgenic mice should get sicker than control mice if this gene is important. PI s question: How many mice to use per group? Is there any published data on control animals?

Example 100% of the mice treated with the carcinogen develop colon tumors. Mean ± SD number of tumors per mouse = 5.0 ± 2.9 Mean ± SD tumor size per tumor = 3.5 ± 0.13 mm PI would prefer to work with no more than 10 mice per group for cost. New transgenic mouse to be studied should have worse colon cancer if gene knocked out is part of colon cancer pathway.

Null Hypothesis Transgenic mouse has colon cancer just as bad as control group. Cost Assessment Scenario for Type I Error (False Positive) Transgenic mouse just happens to show worse colon cancer in this experiment, but the new mouse strain really is not any different than regular mouse. Results get published, but then retracted later when future studies show futility. Scenario for Type II Error (False Negative) Transgenic mice does not get any sicker than control mice in our experiment, but the transgenic mouse really does have a higher rate of colon cancer. Study stops and the no new experiments are planned with the new mouse breed; other genes are studied.

Null Hypothesis Transgenic mouse has colon cancer just as bad as control group. Cost Assessment Scenario for Type I Error (False Positive) Transgenic mouse just happens to show worse colon cancer in this experiment, but the new mouse strain really is not any different than regular mouse. Results get published, but then retracted later when future studies show futility. Scenario for Type II Error (False Negative) Transgenic mice does not get any sicker than control mice in our experiment, but the transgenic mouse really does have a higher rate of colon cancer. Study stops and the no new experiments are planned with the new mouse breed; other genes are studied. Risk of this Scenario = α Set to 5% Risk of this Scenario = β Set to 20%

Final Power Analysis for NIH Grant For Experiment 2.3: With this treatment (AOM 10 mg/kg i.p., once a week for 4 weeks) mice usually develop multiple colon tumors with 100% penetrance. When mice are sacrificed we will count the number of tumors and the average size of the tumor per animals. According to published data mice develop ~5 (sd=3) tumors/mouse with the average size 3.5mm (sd=0.13). 32 We will sacrifice these mice 5 weeks after the first injection. We expect that Car1/GH/IRES/GFP+/+ mice develop more tumors and the size of tumors will be bigger. We have therefore chosen 6 animals per group to achieve 80% power to detect an average increase of 5 tumors per mouse and 0.2mm increase in average size in a twosided two sample t test at the 0.05 significance level.

Example, part 2 Test of new drug to potentially stop colon cancer using same mouse model. 100% of mice treated with a know carcinogen develop colon tumors 6 months later. How many mice do we need to test a decrease of colon cancer incidence rate?

Power Analysis Results: Based on the assumption that our control group of mice will have an incidence rate at ~100% in developing colon cancer, a future study of the new drug with 10 animal in each group will have 80% power in a two sided Fisher s Exact Test at the 0.05 significance level, assuming the treatment group will have an incidence level of 40% or lower.

All NIH grants, IRB & IACUC protocols are reviewed by a biostatistician who know what a properly written sample size justification should look like.

Some Common Mistakes Common misconception: p<0.05 means the probability that the null hypothesis is true is less than 0.05. Correct interpretation: smaller p value indicates the weight of the evidence against the null hypothesis is stronger, and not simply a random chance. It is NOT more statistically significant. Differences are either statistically significant or they are not.

Using Pilot Data Inappropriately Pilot data should be an independent collection of data to analyze in planning a future study It should not be a small sample that failed to demonstrate significant differences and more cases will be added to. If plan is to increase sample size for initial smaller failed experiment, decision should be made a priori, and plan for interim analysis, alpha spending rules, etc.

More Common Mistakes Common misconception: Doubling the sample cuts the standard deviation in half One sided versus two sided Non random sampling Dichotomizing or classifying data extracted from a continuous variable

Cost of Dichotomization

The Error of Post Hoc Power Analysis When a study has negative results it is inappropriate to calculate how much power the study had. All statistical tests that have p>0.05 will have poor observed power using the data at hand. The implied conclusion of post hoc power is that the effect observed could be real but the sample size was too small. However, the observed mean will vary with each trial. Instead the confidence interval should be computed for the observed difference from groups and interpreted as such. Hoenig, John M., and Dennis M. Heisey. "The abuse of power." The American Statistician 55.1 (2001).

Simple Formula for Difference Between two Means (not appropriate for small sample sizes) Sample size in each group (assumes equal sized groups) n Standard deviation of the outcome variable Represents the desired power (typically.84 for 80% power). 2 2 ( Z Z /2 2 difference Effect Size (the difference in means) ) 2 Represents the desired level of statistical significance (typically 1.96)

Simple Formula for Difference Between two Proportions (not appropriate for small sample sizes) Sample size in each group (assumes equal sized groups) Represents the desired power (typically.84 for 80% power). n A measure of variability (similar to standard deviation) 2( p)(1 (p p)( Z 1 p 2 Effect Size (the difference in proportions) ) 2 Z /2 ) 2 Represents the desired level of statistical significance (typically 1.96).

Online Calculators

Calculations are based on formulas for the 1 sample Z test

Calculations are based on formulas for the 1 sample Z test

Online Calculators

PS: Power and Sample Size Demo Free PS software (v 3.1.2) available at: http://biostat.mc.vanderbilt.edu/wiki/main/powersam plesize Dupont WD, Plummer WD: 'Power and Sample Size Calculations: A Review and Computer Program', Controlled Clinical Trials 1990; 11:116 28.

Step 1: Chose your type of statistical test

Step 2: Choose what to solve for Step 3: Input parameters 2 group t test significance level = 0.05 standard deviation = 2.9 difference in group means = 4 power = 0.80 m= ratio of samples in each group

Auto generated sample size justification paragraph!!

G*Power Demo Free software (v 3.1.9.2) available at: http://www.gpower.hhu.de/en.html Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175 191.

Excellent G*Power Tutorial: http://www.ats.ucla.edu/stat/gpower/

G*Power Mann Whitney (rank test) Non parametric test of two group means

Non parametric test of two group means

G*Power ANOVA ANOVA 3 groups 10 per group Alpha = 0.05 Power = 0.80

Minimum detectable effect size = 0.6

Absence of Evidence is not Evidence of Absence

http://www.youtube.com/watch?v=pbodigczql8