Erik Parner 14 September 2016. Basic Biostatistics - Day 2-21 September, 2016 1

Similar documents
Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

NCSS Statistical Software

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Non-Inferiority Tests for Two Means using Differences

DATA INTERPRETATION AND STATISTICS

2 Precision-based sample size calculations

Simple Linear Regression Inference

NCSS Statistical Software

Permutation Tests for Comparing Two Populations

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Chapter 7 Section 7.1: Inference for the Mean of a Population

Confidence Intervals for the Difference Between Two Means

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test

MULTIPLE REGRESSION EXAMPLE

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Dongfeng Li. Autumn 2010

NONPARAMETRIC STATISTICS 1. depend on assumptions about the underlying distribution of the data (or on the Central Limit Theorem)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

How To Compare Birds To Other Birds

Non-Parametric Tests (I)

Statistics for Sports Medicine

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see. level(#) , options2

Difference of Means and ANOVA Problems

SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS

Basic Statistical and Modeling Procedures Using SAS

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

PRACTICE PROBLEMS FOR BIOSTATISTICS

11. Analysis of Case-control Studies Logistic Regression

3.4 Statistical inference for 2 populations based on two samples

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

THE KRUSKAL WALLLIS TEST

The Wilcoxon Rank-Sum Test

Independent t- Test (Comparing Two Means)

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Nonparametric Statistics

Comparing Means in Two Populations

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Recall this chart that showed how most of our course would be organized:

II. DISTRIBUTIONS distribution normal distribution. standard scores

Principles of Hypothesis Testing for Public Health

UNDERSTANDING THE TWO-WAY ANOVA

Skewed Data and Non-parametric Methods

Nonparametric Statistics

Permutation & Non-Parametric Tests

Sample Size and Power in Clinical Trials

Section 13, Part 1 ANOVA. Analysis Of Variance

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Multinomial and Ordinal Logistic Regression

Statistics. One-two sided test, Parametric and non-parametric test statistics: one group, two groups, and more than two groups samples

Analysis of Variance ANOVA

Tutorial 5: Hypothesis Testing

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

t-test Statistics Overview of Statistical Tests Assumptions

CALCULATIONS & STATISTICS

Statistics Review PSY379

Difference tests (2): nonparametric

Parametric and non-parametric statistical methods for the life sciences - Session I

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Biostatistics: Types of Data Analysis

Using Stata for Categorical Data Analysis

Guide to Microsoft Excel for calculations, statistics, and plotting data

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

SUMAN DUVVURU STAT 567 PROJECT REPORT

HYPOTHESIS TESTING WITH SPSS:

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Descriptive Statistics

Non-Inferiority Tests for One Mean

How to set the main menu of STATA to default factory settings standards

1 Nonparametric Statistics

12: Analysis of Variance. Introduction

Parametric and Nonparametric: Demystifying the Terms

Analysis and Interpretation of Clinical Trials. How to conclude?

StatCrunch and Nonparametric Statistics

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Projects Involving Statistics (& SPSS)

Study Guide for the Final Exam

HYPOTHESIS TESTING: POWER OF THE TEST

Quick Stata Guide by Liz Foster

A POPULATION MEAN, CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

How To Check For Differences In The One Way Anova

Unit 27: Comparing Two Means

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

Week 4: Standard Error and Confidence Intervals

Unit 26: Small Sample Inference for One Mean

Opgaven Onderzoeksmethoden, Onderdeel Statistiek

SPSS Tests for Versions 9 to 13

1.5 Oneway Analysis of Variance

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Interpretation of Somers D under four simple models

Using Excel for inferential statistics

Introduction to Statistics and Quantitative Research Methods

Transcription:

PhD course in Basic Biostatistics Day Erik Parner, Department of Biostatistics, Aarhus University Log-transformation of continuous data Exercise.+.4+Standard- (Triglyceride) Logarithms and exponentials Two independent samples from normal distributions The model, check of the model, estimation Comparing the two means Approximate confidence interval and test Exact confidence interval and test using the t-distribution Comparing two populations using a non-parametric test The Wilcoxon-Mann-Whitney test Two independent samples from normal distributions Type and type errors Statistical power Sample size calculations Basic Biostatistics - Day Overview Data to analyse Type of analysis Unpaired/Paired Type Day Continuous One sample mean Irrelevant Parametric Day Nonparametric Day 3 Two sample mean Non-paired Parametric Day Nonparametric Day Paired Parametric Day 3 Nonparametric Day 3 Regression Non-paired Parametric Day 5 Several means Non-paired Parametric Day 6 Nonparametric Day 6 Binary One sample mean Irrelevant Parametric Day 4 Two sample mean Non-paired Parametric Day 4 Paired Parametric Day 4 Regression Non-paired Parametric Day 7 Time to event One sample: Cumulative risk Irrelevant Nonparametric Day 8 Regression: Rate/hazard ratio Non-paired Semi-parametric Day 8 Basic Biostatistics - Day Log-transformation of continuous data Continuous data with a long tail to the right are often logtransformed to obtain an approximate normal distribution. Recall the triglyceride measurements. Applying a normal based prediction interval (PI) on the original data gives invalid results: e.g. the PI will not have.5% below and above the two limits. The logarithm of the triglyceride measurements follows (approximately) a normal distribution: Density.8.6.4 Log-triglyceride.5 -.5 -.5. -.5 - -.5 - -.5.5 Log-triglyceride - - -.5 - -.5.5 Inverse Normal % of data Density.5.5 4.% of data We then need to transform the results back to the original scale to obtain useful results on the triglyceride measurements. The method presented here relies on the fact that percentiles are preserved when creating a transformation of the data. Basic Biostatistics - Day.5.5 Triglyceride Basic Biostatistics - Day 3 4 Basic Biostatistics - Day - September, 6

y y Erik Parner 4 September 6 Logarithmic and exponential functions Both the logarithm and the exponential function are increasing functions. Logarithm 8 6 Exponential Logarithmic and exponential transformations Medians and percentiles are preserved when making a transformation of the data: 5% to the right - - -3 4 exp 6 % to the right log.5.5 x - - x Thus ( X) < ( A) X < A ( X) < ( A) exp exp log log Prediction intervals are given by.5 and 97.5 percentile. For a normal distribution the mean is equal to the median =5 percentile. Basic Biostatistics - Day Basic Biostatistics - Day 5 6 Transforming the results Summary Density.8.6.4. PI (-.54;-.) CI mean -.77(-.8;-.74) Let Y denote the original observation. If X=log(Y) has a normal distribution with mean=median=m, and standard deviation=s,then Density.5.5.5 - -.5 - -.5.5 Log-triglyceride.5.5 Triglyceride PI (.;.99) exp CI median.46 (.44;.48) a valid 95% CI for m will transform into a valid 95% CI for the median of Y = exp(x) a valid 95% PI for X will transform into a valid 95% PI for Y = exp(x) The relation between the means and medians are median( Y) = exp Basic Biostatistics - Day 7 Basic Biostatistics - Day 8 ( m) ( m s ) meany ( ) = exp +.5 Basic Biostatistics - Day - September, 6

It can be shown that ( ) sdy ( ) = meany ( ) exp s - Hence the standard deviation of Y depends on the mean of Y. For this reason the standard deviation is rarely used as a measure of the spread of the distribution of the original data in this setting. In this setting the coefficient of variation (cv) is often used as a measure of the spread of the data sdy ( ) cvy ( ) = = exp( s )- meany ( ) Properties logarithm and exponential function The basic properties of the logarithms and exponentials that we will use throughout the course: Product log exp Sum ( a b) = ( a) + ( b) ( a b) = ( a) - ( b) ( a+ b) = ( a) ( b) ( a- b) = ( a) ( b) log log log log log log exp exp exp exp exp exp b b ( a ) = b ( a) ( a b) = ( a) = ( b) log log exp exp exp a Basic Biostatistics - Day 9 Basic Biostatistics - Day Continuous data two sample mean Body temperature versus gender Scientific question: Do the two gender have different normal body temperature? Design: 3 participants were randomly sampled, 65 males and 65 females Data: Measured temperature, gender Summary of the data (the units are degrees Celsius): -------------------------------------------------------------- Gender N(tempC) mean(tempc) sd(tempc) med(tempc) ----------+--------------------------------------------------- Male 65 36.765.38858 36.7 Female 65 36.8893.47359 36.9 -------------------------------------------------------------- Basic Biostatistics - Day Body temperature: Plotting the data Figure. Temperature (C) 35.5 36 36.5 37 37.5 38 Temperature (C) 35.5 36 36.5 37 37.5 38 Male Female Gender Male Female The data looks fine - a few outliers among females? Basic Biostatistics - Day Basic Biostatistics - Day - September, 6 3

Body temperature: Checking the normality in each group Figure. Density.5.5 Male Female 35 36 37 38 Graphs by Gender Normality looks ok! 35.5 36 36.5 37 37.5 38 35.5 36 36.5 37 37.5 38 Male 36 36.5 37 37.5 Inverse Normal Female 36 36.5 37 37.5 38 Inverse Normal Basic Biostatistics - Day 3 A statistical model: Body temperature: The model Two independent samples from normal distributions, i.e. the two samples are independent and each are assumed to be a random sample from a normal distribution:. The observations are independent (knowing one observation will not alter the distribution of the others). The observations come from the same distribution, e.g. they all have the same mean and variance. 3. This distribution is a normal distribution with unknown mean, m i, and standard deviation, s i. N(m i, s i ) Basic Biostatistics - Day 4 Body temperature: Checking the assumptions The first two think about how data was collected!. Independence between groups information on different individuals Independence within groups: Data are from different individuals, so the assumption is probably ok.. In each group: The observations come from the same distribution. Here we can only speculate. Does the body temperature depend on known factors of interest, for example heart rate, time of day, etc.? Body temperature: The estimates The estimates are found like we did day : ( ) ( ) ( ) ( ) ˆ m = 36.73 36.63;36.8, sˆ =.388, sem ˆ m =.48 M M M ˆ m = 36.89 36.79;36.99, sˆ =.43, sem ˆ m =.5 F F F Observe that the width of the prediction interval is approximately *.96 *.4 C =.6 C, so there is a large variation in body temperature between individuals within each of the two groups We see that the average body temperature is higher among women Basic Biostatistics - Day 5 Basic Biostatistics - Day 6 Basic Biostatistics - Day - September, 6 4

Body temperature: Estimating the difference Remember focus is on the difference between the two groups, meaning, we are interested in : d = mf -mm The unknown difference in mean body temperature. This is of course estimated by: dˆ = ˆ m - ˆ m = 36.89-36.73 =.6 F M What about the precision of this estimate? What is the standard error of a difference? Basic Biostatistics - Day 7 The standard error of a difference If we have two independent estimates and, like here, calculate the differences, then the standard error of the difference is given as ( dˆ) = ( ˆ m ˆ ) ( ˆ ) ( ˆ F - mm = mf + mm ) se se se se We note that standard error of a difference between two independent estimates is larger than both of the two standard errors. In the body temperature data we get: and an approx. 95% CI ( ˆ ) se d =.48 +.5 =.7 ( dˆ) ( ) dˆ.96 se =.63.96.7 =.5;.3 Basic Biostatistics - Day 8 Testing no difference in means ( ) ( ˆ) d :. 63.5;.3 se d =.7 Here we are especially interested in the hypothesis that body temperature is the same for the two gender: Hypothesis: d = d = We can make an approx. test similar to day dˆ-d dˆ-.63 - zobs = = = =.3 se dˆ se dˆ.7 and find the p-value as We get p=.3% ( ) ( ) ( z ) Pr standard normal obs Exact inference for two independent normal samples Just like in the one sample setting, it is possible to make exact inference based on the t-distribution. And again these are easily made by a computer. Rememberthe model: Two independent samples from normal distributions with means and standard deviations, m, s and m, s M M F F Note, both the means and the standard deviations might be different in the two populations. If one wants to make exact inference, then one has to make the additional assumption: 4. The standard deviations are the same: s M = s F Basic Biostatistics - Day 9 Basic Biostatistics - Day Basic Biostatistics - Day - September, 6 5

Exact inference for two independent normal samples Testing the hypothesis : s M = s F This is done by considering the ratio between the two estimated standard deviations: F obs Ø Largest observed standard deviation ø = Œ Smallest observed standard deviation œ º ß A large value of this F-ratio is critical for the hypothesis Thep-value = the probability of observing a F-ratio at least as large as we have observed - given the hypothesis is true! The p-value is here found by using an F-distribution with (n largest -) and (n smallest -) degrees of freedom: ( ( largest smallest ) obs ) p - value = Pr F n -; n - F Basic Biostatistics - Day Exact inference for two independent normal samples Testing the hypothesis : s M = s F Here we have: n ˆ F = 65 sf =.43 Ø.43ø so F obs = Œ = = n ˆ º.388œ M = 65 sm =.388 ß The observed variance (sd ) is 3% higher among women. But could this be explained by sampling variation what is the p-value? To find the p-value we consult an F-distribution with 64=(65-) and 64=(65-) degrees of freedom. We get p-value = 63%.63.3 The difference in the observed standard deviation can be explained by sampling variation. We accept that s M = s F! The fourth assumption is ok! Basic Biostatistics - Day Exact inference for two independent normal samples We now have a common standard deviation : s = s F = s M This is estimated as a weighted average sˆ = sˆ ( n - ) + sˆ ( n -) ( n - ) + ( n -) F F M M F ( - ) + ( - ) ( 65- ) + ( 65-).43 65.388 65 M = =.4 Based on this we can calculate a revised/updated standard error of the difference: ( dˆ) se = sˆ + =.4 + =.7 n n 65 65 F M This is not found in the Stata output Exact inference for two independent normal samples dˆ:.63 se d ˆ =.7 ( ) Exact confidence intervals and p-values are found by using a t-distribution with n M + n F - = 65 + 65- = 8 d.f. dˆ t se dˆ =.63.96.7 =.4;. 3.975 And the exact test: ( ) ( ) dˆ -.63 H : d = tobs = = =.3 se ˆ.7 ( d ) and find the p-value as Pr( t-distribution tobs ) We get p=.% (either from table of standard normal distribution, or from Stata) Basic Biostatistics - Day 3 Basic Biostatistics - Day 4 Basic Biostatistics - Day - September, 6 6

Stata: two-sample normal analysis The F-test and t-test are easily done in Stata (more details can be found in the file day.do).. cd "D:\Teaching\BasalBiostat\Lectures\Day" D:\Teaching\BasalBiostat\Lectures\Day. use normtemp.dta, clear. * Checking the normality.. qnorm tempc if sex==, title("male") name(plot, replace). qnorm tempc if sex==, title("female") name(plot3, replace). graph combine plot plot3, name(plotright, replace) col(). sdtest tempc, by(sex) Variance ratio test --------------------------------------------------------------- Group Obs Mean Std.Err. Std.Dev. [95% Conf.Interval] --------+------------------------------------------------------ Male 65 36.765.485.38858 36.6996 36.835 Female 65 36.8893.5936.47359 36.78696 36.995 --------+------------------------------------------------------ combined 3 36.8769.35736.47448 36.73699 36.87839 --------------------------------------------------------------- ratio = sd(male) / sd(female) f =.8847 Ho: ratio = degrees of freedom = 64, 64 Ha: ratio < Ha: ratio!= Ha: ratio > Pr(F < f) =.38 *Pr(F < f)=.656 Pr(F > f)=.687 Basic Biostatistics - Day 5 Basic Biostatistics - Day 6. ttest tempc, by(sex) Two-sample t test with equal variances --------------------------------------------------------------- Group Obs Mean Std.Err. Std.Dev. [95%Conf.Interval] -------+------------------------------------------------------- Male 65 36.765.485.38858 36.6996 36.835 Female 65 36.8893.5936.47359 36.78696 36.995 -------+------------------------------------------------------- combined 3 36.8769.35736.47448 36.73699 36.87839 -------+------------------------------------------------------- diff -.63766.78 -.3396 -.436 --------------------------------------------------------------- diff = mean(male) - mean(female) t = -.34 Ho: diff = degrees of freedom = 8 Ha: diff < Ha: diff!= Ha: diff > Pr(T < t) =. Pr( T > t )=.9 Pr(T > t)=.989 Basic Biostatistics - Day 7 Exact inference for two independent normal samples What if you reject the hypothesis of the same sd in the two groups?. This indicates that the variation in the two groups differ! Think about why!!!. Often it is due to the fact that the assumption of normality is not satisfied. Maybe you would do better by making the statistical analysis on another scale, e.g. log. 3. If you still want to compare the means on the original scale you can make approximate inference based on the t-distribution (e.g. ttest tempc, by(sex) unequal ) 4. If you only want to test the hypothesis that the two distributions are located the same place, then can you use the non-parametric Wilcoxon-Mann-Whitney test see later. Basic Biostatistics - Day 8 Basic Biostatistics - Day - September, 6 7

Body temperature example - formulations Methods: Data was analyzed as two independent samples from normal distributions based on the Students t. The assumption of normality was checked by a Q-Q plot. Estimates are given with 95% confidence intervals. Results: The mean body temperature was 36.9(36.8;37.)C among women compared to 36.7(36.6;36.8)C among men. The mean was.6(.;.3)c, higher for females and this was statistically significant (p=.3%). Conclusion: Based on this study we conclude that women have a small, but statistically significantly higher mean body temperature than men. Example 7. Birth weight and heavy smoking Scientific question: Does the smoking habits of the mother influence the birth weight of the child? Design and data: (observational) The birth weight (kg) of children born by 4 heavy smokers and 5 non-smokers were recorded. Summary of the data (the units is kg): ------------------------------------------------------------------------ Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+--------------------------------------------------------------- Non-smok 5 3.67.95.3584 3.48 3.85 Heavy sm 4 3.74.38.463.97 3.44 Already here we observe, that the average birth weight is smallest among heavy-smokers: difference=45 g Basic Biostatistics - Day 9 Basic Biostatistics - Day 3 Example 7. Birth weight and heavy smoking Plot the data!!!!!! 4.5 4.5.5 Example 7. Birth weight and heavy smoking Non-smoker Non-smokers 4.5 4 3.5 4 4.5 3.5 Birth weight 3.5 Birth weight 3.5 Density.5 Heavy smoker 3 3.5 4 4.5 Inverse Normal Heavy smokers 4.5 3 3 4 3.5.5 3.5 Non-smoker Heavy smoker Smoking habits.5 Non-smoker Heavy smoker.5 3 4 5.5 3 3.5 4 Inverse Normal Graphs by Smoking habits Independence,same distribution and normality seems ok. Basic Biostatistics - Day 3 Basic Biostatistics - Day 3 Basic Biostatistics - Day - September, 6 8

Example 7. Birth weight and heavy smoking exact inference Compare the standard deviations (using the computer): Ø.463ø Fobs =.64 p 35% from F (3,4) Œ = = º.3584 œ ß We accept that the two standard deviations are identical. and again by computer we get: Difference in mean birth weight:.45(.38;.767) kg Hypothesis: no difference in mean birth weight. p=.6% Conclusionof the test: If there was no difference between the two groups, then it would be almost impossible to observe such a large difference as we have seen hence the hypothesis cannot be true! Basic Biostatistics - Day 33 The birth weight example - formulations Methods - like the body temperature example: Data intervals. Results: The mean birth weight was 3.67(3.48;3.85) kg among nonsmokers compared to 3.74(.97;3.44) kg among heavy smokers. The difference 45(38;767)g was statistically significant (p=.6%). Conclusion: Children born by heavy-smokers have a birth weight, that is statistically significantly smaller, than that of children born by non-smokers. The study has only limited information on the precise size of the association. Furthermore we have not studied the implications of the difference in birth weight or whether the difference could be explained by other factors, like eating habits Basic Biostatistics - Day 34 Non-Parametric test: Wilcoxon-Mann-Whitney test Until now we have only made statistical inference based on a parametric model. E.g. we have focused on estimating the difference between two groups and supplying the estimate with a confidence interval. We have also performed a statistical test of no difference based on the estimate and the standard error aparametric test. There are other types of tests non-parametric tests that are not based on a parametric model. These test are also based on models, but they are not parametric models. We will here look at the Wilcoxon-Mann-Whitney test, which is the non-parametric analogy to the two sample t-test. Basic Biostatistics - Day 35 Non-Parametric test: Wilcoxon-Mann-Whitney test The key feature of all non-parametric tests is, that they are based on the ranks of the data and not the actual values. Heavy smokers Non-smokers Birth Birth weight Rank weight Rank.34.7 3.38 3.3.74 4 3.36.86 5 3.4.9 6 3.5 4 3.8 7 3.54 6 3.3 8 3.6 7.5 3.7 9 3.6 9 3.4 3 3.7 3 3.53 5 3.73 4 3.6 7.5 3.83 5 3.65.5 3.89 6 3.65.5 3.99 7 3.69 4.8 8 4.3 9 Smallest Number 7 and 8 Basic Biostatistics - Day 36 Basic Biostatistics - Day - September, 6 9

Non-Parametric test: Wilcoxon-Mann-Whitney test We can now add the rank in one of the groups, here the heavy smokers: Heavy-smokers observed rank sum=5.5 Hypothesis:The birth weights among heavy-smokers and non-smokers is the same. Assuming the hypothesis is true one can calculate the expected rank sum among the heavy-smokers and standard error of the observed rank sum and calculate a test statistics: z obs Observed ranksum -Expected ranksum = se( Observed ranksum) 5.5 - = =-.597.9 P-value =.9% The p-value is found as Pr( standard normal zobs ) Basic Biostatistics - Day 37 Non-Parametric test: Wilcoxon-Mann-Whitney test We saw that the ranksum among heavy smokers was smaller than expected if there was no true difference between the two groups. So small that we only observe such a discrepancy in one out of (p-val=.9%) studies like this. We reject the hypothesis! Conclusion Children born by heavy-smokers have a statistically significant lower birth weight than children born by nonsmokers. Remember this depends on, the sample size, the design, the statistical analysis... Basic Biostatistics - Day 38 Non-Parametric test: Wilcoxon-Mann-Whitney test Some comments: There are two assumptions behind the test:. Independence between and within the groups.. Within each group: The observations come from the same distribution, e.g. they all have the same mean and variance. The test is designed to detect a shift in location in the two populations and not, for example, a difference in the variation in the two populations. You will only get a p-value the possible difference in location will is not quantified by an estimate with a confidence interval. As a test it is just as valid as the t-test! Basic Biostatistics - Day 39 Stata: Wilcoxon-Mann-Whitney test. use bwsmoking.dta,clear (Birth weight (kg) of 9 babies born to 4 heavy smokers and 5 non-smokers). ranksum bw, by(group) Two-sample Wilcoxon rank-sum (Mann-Whitney) test group obs rank sum expected -------------+--------------------------------- Non-smoker 5 84.5 5 Heavy smoker 4 5.5 -------------+--------------------------------- combined 9 435 435 unadjusted variance 55. adjustment for ties -.6 ---------- adjusted variance 54.74 Ho: bw(group==non-smoker) = bw(group==heavy smoker) z =.597 Prob > z =.94 Basic Biostatistics - Day 4 Basic Biostatistics - Day - September, 6

Type and type errors We will here return to the simple interpretation of a statistical test: We test a hypothesis: d = d We will make a Type error if we reject the hypothesis, if it is true. Type error if we accept the hypothesis, if it is false. If we use a specific significance level, a, (typically 5%) then we know: Pr reject d = d given it is true = Pr ( ) ( reject = given = ) = d d d d a The risk of a Type error = a Basic Biostatistics - Day 4 Type and type errors What about the risk of Type error: Pr( accept given it is not true) ( accept d = d given d d ) = b = d = d = Pr? This will depend on several things:. The statisticalmodel and test we will be using. What is the true value of d? 3. The precision of the estimate. What is the sample size and standard deviation? That is, the risk of Type error, b, is not constant. Often we consider the statistical power: ( reject given ) Pr d = d d d = -b Basic Biostatistics - Day 4 Statistical power planning a study - testing for no difference Suppose we are planning a new study of fish oil and its possible effect on diastolic blood pressure (DBP). Assume we want to make a randomized trial with two groups of equal size and we will test the hypothesis of no difference. We believe that the true difference between groups in DBP is 5mmHg. Furthermore we believe that the standard deviation in the increase in DBP is 9mmHg. We plan to include 4 women in each group and analyze using a t-test. What is the chance, that this study will lead to a statistically significant difference between the two groups, given the true difference is 5mmHg? Basic Biostatistics - Day 43 Statistical power, when the true difference is 5 and sd= 7,8,9 or and we test the hypothesis of no difference. Power in % 9 8 7 6 5 4 3 n=4 power=69% True difference = 5 - Test for no difference sd= sd=9 sd=8 sd=7 4 6 8 Observations in each group Basic Biostatistics - Day 44 Basic Biostatistics - Day - September, 6

Statistical power planning a study We plan to include 4 women in each group and analyze using a t-test and the true difference is 5mmHg and sd=9mmhg Power = 69% That is, there is only 69% chance, that such a study will lead to a statistical significant result - given the assumptions are true. How may women should we include in each group if we want to have a power of 9%? Based on the plot we see that more than aprox. 69 women in each group will lead to a power of 9%. Statistical power, when the true difference is 5 and sd= 7,8,9 or and we test the hypothesis of no difference. Power in % 9 8 7 6 5 4 3 power=9% n=69 True difference = 5 - Test for no difference sd= sd=9 sd=8 sd=7 4 6 8 Observations in each group Basic Biostatistics - Day 45 Basic Biostatistics - Day 46 The power increases as a function of the expected difference between the groups and decreases as a function of the variation, standard deviation, within the groups Power in % 9 8 7 6 5 4 3 True difference = - Test for no difference sd= sd=9 sd=8 sd=7 4 6 8 Observations in each group Power two unpaired normal samples In general we have the five quantities in play: d = m -m The true difference between groups s = a = b = n= The standard deviation within each group The significance level (typically 5%) The risk of type error = -the power The sample size in each group If we know four of these, then we can determine the last. Typically, we know the first four and want to know the sample size. or we know d, s, a and n and then we want to know the power. Basic Biostatistics - Day 47 Basic Biostatistics - Day 48 Basic Biostatistics - Day - September, 6

Stata: Paired sample from a normal distribution Power calculations are done using the power command:. power twomeans 5, sd(9) sd(9) alpha(.5) power(.9) Performing iteration... Estimated sample sizes for a two-sample means test Satterthwaite's t test assuming unequal variances Ho: m = m versus Ha: m!= m Study parameters: alpha =.5 power =.9 delta = 3.867 m =. m = 5. sd = 9. sd = 9. Estimated sample sizes: N = 4 N per group = 7 * Prior to Stata 3: * sampsi 5, sd(9) sd(9) alpha(.5) power(.9) Basic Biostatistics - Day 49 Comments on sample size calculations Most often done by computer (in Stata power) There are many different formulas see Kirkwood & Stern Table 35.. We will only look at a few in this course. It is in general more relevant to test that the difference is larger than a specified value. A so-called Superiority or Non-inferiority study. Or to plan the study so that your study is expected to yield a confidence interval with a certain width. You need to know the true difference and you must have an idea of the variation within the groups. The latter you might find based on hospital records or in the literature. Sample size calculations after the study has been carried out (post hoc) is nonsense!! The confidence interval will show how much information you have in the study. Basic Biostatistics - Day 5 Basic Biostatistics - Day - September, 6 3