Types of Biostatistics. Lecture 18: Review Lecture. Types of Biostatistics. Approach to Modeling. 2) Inferential Statistics

Transcription

1 Types of Biostatistics Lecture 18: Review Lecture Ani Manichaikul 15 May ) Inferential Statistics Confirmatory Data Analysis Methods Section of paper Goal: quantify relationships, test hypotheses Types of Biostatistics 1) Descriptive Statistics Exploratory Data Analysis often not in literature Summaries "Table 1" in a paper Goal: visualize relationships, generate hypotheses Approach to Modeling A general approach for most statistical modeling is to: Define the Population of Interest State the Scientific Questions & Underlying Theories Describe and Explore the Observed Data Define the Model Probability part (models the randomness / noise) Systematic part (models the expectation / signal)

2 Approach to Modeling Estimate the Parameters in the Model Fit the Model to the Observed Data Make Inferences about Covariates Check the Validity of the Model Verify the Model Assumptions Re-define, Re-fit, and Re-check the Model if necessary Interpret the results of the Analysis in terms of the Scientific Questions of Interest Grouping: Frequency Distribution Tables Shows the number of observations for each range of data Intervals can be chosen in ways similar to stem-and-leaf displays Age Interval Frequency Stem-and-Leaf Plots Age in years (10 observations) 25, 26, 29, 32, 35, 36, 38, 44, 49, 51 Histograms Pictures of the frequency or relative frequency distribution Age Interval Observations Frequency Histogram of Age Age Category

3 Box-and-Whisker Plots Box Plot of Age Age in Years IQR = = 15 Upper Fence = *1.5 = 66.5 Lower Fence = 29 15*1.5 = Continuous Variables Scatterplot Height in Centimeters Age by Height in cm Age in Years Scatterplots visually display the relationship between two continuous variables Why is the power of a test important? Power indicates the chance of finding a significant difference when there really is one Low power: like to obtain non-significant results even when significant differences exist High power is desirable! Low power is usually cause by small sample size

4 We re not always right Errors in Hypothesis Testing " Aim: To keep Type II error small and thus power high Errors in Hypothesis Testing! Aim: to keep Type I error small by specifying a small rejection region! is set before performing a test, usually at 0.05 ": Probability of Type II Error The value of " is usually unknown since it depends on a specified alternative value. " depends on sample size and!. Before data collection, scientists decide the test they will perform! the desired " They will use this information to choose the sample size

5 P-Values Definition: The p-value for a hypothesis test is the probability of obtaining by chance, alone, when H 0 is true, a value of the test statistic as extreme or more extreme (in the appropriate direction) than the one actually observed. Why use linear regression? Linear regression is very powerful. It can be used for many things: Binary X Continuous X Categorical X Adjustment for confounding Interaction Curved relationships between X and Y Steps of Hypothesis Testing Define the null hypothesis, H 0. Define the alternative hypothesis, H a, where H a is usually of the form not H 0. Define the type 1 error,!, usually Calculate the test statistic Calculate the P-value If the P-value is less than!, reject H 0. Otherwise fail to reject H 0. SLR: Y=! 0 +! 1 X 1!" Linear regression is used for continuous outcome variables! 0 : mean outcome when X=0 (Center!) Binary X = dummy variable for group! 1 : mean difference in outcome between groups Continuous X! 1 : mean difference in outcome corresponding to a 1-unit increase in X Center X to give meaning to! 0 Test! 1 =0 in the population 20

6 Assumptions of Linear Regression Regression Methods L Linear relationship I Independent observations N Normally distributed around line E Equal variance across X s In Simple Linear Regression Regression Methods In simple linear regression (SLR): One Predictor / Covariate / Explanatory Variable: X In multiple linear regression (MLR): Same Assumptions as SLR, (i.e. L.I.N.E.), but: More than one Covariate: X 1, X 2, X 3,, X p Model: Y ~ N(µ, # 2 ) µ = E(Y X) = " 0 + " 1 X 1 + " 2 X 2 + " 3 X " p Xp

7 Nested models One model is nested within another if the parent model contains one set of variables and the extended model contains all of the original variables plus one or more additional variables. The F test H 0 : all new! s=0 in population H A : at least one new! is not 0 in population F obs = ( RSSparent $ RSSnested ) ( # of new variablesadded ) RSS ( 69.6 $ 49.8) nested F 2 obs = = What is F cr? residual df nested Difference in assessing variables: nested models other predictor(s) assess with t test if single variable defines predictor assess with F test (today) if two or more variables are needed to define the predictor potential confounder(s) compare CI of primary predictor to see whether new parameter is significantly different The F test: notes The F test can be used to compare any two nested models If only one variable is added, it s easier to compare the models using the t test for that variable t 2 =F if one variable is added For any regression, the estimated variance of the residuals is RSS/(residual df)

8 Nested Models Comparing nested models 1 new variable: use t test for that variable 2+ new variables: use F test Categorical predictor set one group as reference create dummy variable for other groups include/exclude all dummy variables evaluate categorical predictor with F test Splines and Quadratic Terms Splines are used to allow the regression line to bend the breakpoint is arbitrary and decided graphically or by hypothesis the actual slope above and below the breakpoint is usually of more interest than the coefficient for the spline (ie the change in slope) Quadratic term allows for curvature in the model 31 Effect Modification In linear regression, effect modification is a way of allowing the association between the primary predictor and the outcome to change with the level of another predictor. If the 3 rd predictor is binary, that results in a graph in which the two lines (for the two groups) are no longer parallel. Logistic regression For binary outcomes Model log odds probability, which we also call the logit Baseline term interpreted as log odds Other coefficients are log odds ratios

9 Logistic regression model log [ odds(relief Tx) ] = log( ( % % ) P(no relief Tx) & * = " 0 + " 1 Tx P(relief Tx) ' And * odds(r D) ' ) odds(r P) & ( % Thus: log ( % = " 1 And: OR = exp(" 1 ) = e "1!! where: Tx = 0 if Placebo 1 if Drug So: exp(" 1 ) = odds ratio of relief for patients taking the Drug-vs-patients taking the Placebo. Then Logistic Regression log( odds(relief Drug) ) = " 0 + " 1 log( odds(relief Placebo) ) = " 0 log( odds(r D)) log( odds(r P)) = " 1 Logit estimates Number of obs = 70 LR chi2(1) = 2.83 Prob > chi2 = Log likelihood = Pseudo R2 = y Coef. Std. Err. z P> z [95% Conf. Interval] drug _cons Estimates: log( odds(relief) ) = ˆ "ˆ " + 0 Drug 1 = (Drug) Therefore: OR = exp(0.814) = 2.26!

10 Adding other variables What if Pr(relief) = function of Drug or Placebo AND Age Types of interpretation! 0 +! 1 = ln(odds) (for X=1)! 1 = difference in log odds We could easily include age in a model such as:! 0! e + 1 e! 1 = odds (for X=1) = odds ratio log( odds(relief) ) = " 0 + " 1 Drug + " 2 Age But we started with P(Y=1). Can we find that? Logistic Regression As in MLR, we can include many additional covariates. For a Logistic Regression model with p predictors: log ( odds(y=1)) = " 0 + " 1 X " p X p Pr( Y = 1) 1 $ Pr( Y = 1) where: odds(y=1) = = Pr( Y Pr( Y = 1) = 0) More useful math probability odds = 1$ probability odds probability = 1+ odds! +! e so probabilityfor + 1+ e 0 1 ( X = 1) =! 0! 1

11 Nested models Adding a single new variable to the model null model: full model: * p ' ln( % =! +! 1 ) 1$ p & ( Age 30) 0 $ * p ' ln( % =! 0 +! ) 1$ p & ( Age $ 30)! ( Multivita min) Conclusion from the Wald test The p-value for multivitamin is (<0.05) and the CI for coefficient multivitamin does not include 0 (CI for OR doesn t include 1) Reject H 0 Conclude that the larger model is better: after adjusting for age, multivitamin use is still an important predictor of physician visits in the population Comparing nested models that differ by one variable Compare models with p-value or CI What method is this? The Wald test, a test that applies the CLT, like Z test comparing proportions in 2x2 table analogous to the t test for linear regression H 0 : the new variable is not needed or H 0 :! new =0 in the population Interpretation - log odds! 0 : the log odds of not visiting a physician for a 30-year-old person who reports not regularly taking multivitamins! 1 : the log odds ratio of not visiting a physician for a one year increase in age controlling for multivitamin use! 2 : the log odds ratio of not visiting a physician for those who take multivitamins compared with those who do not, adjusting for age

12 Interpretation odds and odds ratio exp{! 0 }: the odds of not visiting a physician for a 30-year-old person who reports not regularly taking multivitamins Interpretation odds and odds ratio exp{! 2 }: the odds ratio of not visiting a physician for those who take multivitamins compared with those who do not is exp{! 2 }=0.46, adjusting for age taking multivitamins is associated with regular physician visits (p=0.007) Interpretation odds and odds ratio exp{! 1 }: after adjusting for multivitamin use, the odds ratio of not visiting a physician changes by a factor of exp{! 1 }=1.001 for each additional year of age additional age is associated with lower frequency of physician visits in these students, but the association is not statistically significant (p>0.05) Interpretation In General * odds(y = 1 X + 1,X ( Also: log 1 2 % ( % = " 1 ( ) odds(y = 1 X,X ) 1 2 And: OR = exp(" 1 )!! exp(" 1 ) is the Multiplicative change in odds for a 1 unit increase in X 1 provided X 2 is held constant. ) ' The result is similar for X 2 % &

13 CHD by smoking and coffee Y i = 1 if CHD case, 0 if control COF i = 1 if Coffee Drinker, 0 if not SMK i = 1 if Smoker, 0 if not p i = Pr (Y i = 1) Logistic Regression Model * ( ) p ' % & COF SMK i log ( = " 0 + " 1 i + " 2 i + " 3 1$ p % i COF SMK Which implies that Pr(Y i =1) is the logistic function! 0 +! 1X i1+ " 2 X i 2 + " 3 e p i =! 0 +! 1 X i 1+ " 2X i 2+ " 3 1+ e i X i i1 X i 2 X i 1X i 2 n i = Number observed at pattern i of Xs Logistic Regression Model Y i are from a Binomial (n i, p i ) distribution Yi are independent log odds (Y i =1) (or, logit( Y i =1) ) is a function of Coffee Smoking and coffee x smoking interaction Interpretations exp{# 1 }: odds ratio of being a CHD case for coffee drinkers -vs- non-drinkers among non-smokers exp{# 1!# 3 }: odds ratio of being a CHD case for coffee drinkers -vsnondrinkers among smokers

14 Interpretations exp{# 2 }: odds ratio of being a CHD case for smokers -vs- non-smokers among non-coffee drinkers exp{# 2!# 3 }: odds ratio of being case for smokers -vs- non-smokers among coffee drinkers exp{# 3 } Interpretations exp{# 3 }: factor by which odds ratio of being a CHD case for coffee drinkers -vsnondrinkers is multiplied for smokers as compared to non-smokers or exp{# 3 }: factor by which odds ratio of being a CHD case for smokers -vs- non-smokers is multiplied for coffee drinkers as compared to non-coffee drinkers Interpretations e " e " 0 fraction of cases among nonsmoking non-coffee drinking individuals in the sample (determined by sampling plan) exp{# 3 }: ratio of odds ratios Some Special Cases Given * Pr( Y = 1) ' log( % = " 0 + " 1COF + " 2SMK + " 3COF * SMK ) Pr( Y = 0) & If # 1 = # 2 = # 3 = 0 Neither smoking nor coffee drinking is associated with increased risk of CHD

15 Some Special Cases Given * Pr( Y = 1) ' log( % = " 0 + " 1COF + " 2SMK + " 3COF * SMK ) Pr( Y = 0) & If # 1 = # 3 = 0 Smoking, but not coffee drinking, is associated with increased risk of CHD Confounding In epidemiological terms, Z is a confounder of the relationship of Y with X if Z is related to both X and Y and Z is not in the causal pathway between X and Y In statistical terms, Z is a confounder of the relationship of Y with X if the X coefficient changes when Z is added to a regression of Y on X Some Special Cases If # 3 = 0 Smoking and coffee drinking are both associated with risk of CHD but the odds ratio of CHD-smoking is the same at levels of coffee Smoking and coffee drinking are both associated with risk of CHD but the odds ratio of CHD-coffee is the same at levels of smoking. Confounding For example, consider the two models Y = # 0 + # 1 X + " 1 Y = $ 0 + $ 1 X + $ 2 Z + " 2 then Z is a confounder of the X, Y relationship if $ 1 " # 1

16 Look at Confidence Intervals Without Smoking OR = e 0.79 = % CI for log(or): 0.79 ± 1.96(0.33) = (0.13, 1.44) 95% CI for OR: (e 0.13, e 1.44 ) = (1.14, 4.24) Conclusion So, ignoring smoking, the CHD and coffee OR is 2.2 (95%CI: ) Adjusting for smoking, gives more modest evidence for a coffee effect In this case-control study, smoking is a weak-to-moderate confounder of the coffee-chd association Look at Confidence Intervals Interaction Model With Smoking (adjusting for smoking) OR = e 0.53 = 1.7 Variable Est Model 3 se z Intercept % CI for log(or): 0.53 ± 1.96(0.35) = (-0.17, 1.22) 95% CI for OR: (e -0.17, e 1.22 ) = (0.85, 3.39) Coffee Smoking Coffee* Smoking

17 Testing Interaction Term Likelihood Ratio Test Z= -0.59, p-value = % Confidence interval for # 1!# 3 (0.42, 3.99) Both of the above suggest that there is little evidence that smoking is an effect modifier! Deviance is a term used for the difference in -2*log likelihood relative to the best possible value from a perfectly predicting model. Change in deviance is the same as change in -2LL. Likelihood Ratio Test LRT Example The Likelihood Ratio Test will help decide whether or not additional term(s) significantly improve the model fit Likelihood Ratio Test (LRT) statistic for comparing nested models is -2 times the difference between the log likelihoods (LLs) for the Null -vs- Extended models the % obtained is identical to % from an analysis of variance test for linear regression models

18 Model comparisons using likelihood ratio test Summary: Adjusted ORs Controlling for the potential confounding of smoking, the coffee odds ratio was estimated to be 1.7 with 95% CI: (.85, 3.4). Hence, the evidence in these data are insufficient to conclude coffee has an independent effect on CHD beyond that of smoking. Summary: Unadjusted ORs The odds of CHD was estimated to be 3.4 times higher among smokers compared to non-smokers 95% CI: (1.7, 7.9) The odds of CHD was estimated to be 2.2 times higher among coffee drinkers compared to non-coffee drinkers 95% CI: (1.1, 4.3) Comparing the models Models C and F are both nested in Model A Models C and F cannot be directly compared to one another, but we can see which has a smaller p-value when compared to Model A C vs. A: X 2 = 26.5 with 2 df F vs. A: X 2 = 21.7 with 3 df

19 What next? Model C improves prediction beyond gender alone (Model A) more than Model F. Model C should be the next parent model, and we should test the new variables in Model F to see if they continue to improve prediction within the context of Model C. When a tentative final model is identified, the assumptions of logistic regression should be checked. Poisson regression model Log-linear model for mean rate where p is the number of predictors in the model Random component: Here: Flexibility in linear models Exponentiating Poisson regression models A spline allows the slope for a continuous predictor to change at a given point; the coefficient is for the difference in log odds ratio An interaction term allows the odds ratio for one variable to differ by the value of a second variable; the coefficient is for the difference in log odds ratio 74

20 Interpreting Poisson regression parameters Person-years In defining rates, it is crucial to state what denominator we have in mind For disease, we are usually interested in disease rate per person, per year If the HIV incidence rate is 5 per 1 million person years, that means we expect to see 5 new cases of HIV per 1 million persons per year Modelling rates Of key interest in Poisson regression models is to make inference about rates of events We are often interested in whether the rate of cancer, or some other disease, varies by population subgroups such as gender, race, or age Modelling Danish Cancer cases with an offset We observed Danish cancer cases in 6 age groups over a period of 4 years The model: predicts log rates per 10,000 person years

21 Interpretation of coefficients Poisson regression for cohort studies Log-linear regression can be used to estimate relative risks for cohort studies (but not case control) Relative risks is like relative rates, but we are comparing risks (probability of disease) instead of rates (expected cases per personyear) across groups Could also estimate relative risk by transforming results from logistic regression More about offsets The purpose of an offset is to specify the denominator of the predicted rates We should always try to use an offset if we suspect the underlying population sizes vary for the observed counts Typically, we ll use log(n) as the offset, where N is the sample size or number of person years generating each count Grand summary Exploratory analysis includes graphs and tables good to get a feel for the data Confirmatory analysis is useful for making definitive conclusions Linear models provide us with a framework in which to perform confirmatory analysis in many settings

22 Grand summary: linear models Linear regression: for continuous (normal) outcomes Logistic regression: for binary outcomes Poisson regression: for counts Grand summary: testing We can test significance of a single predictor using z-test (or t-test for linear regression) Test significance of several covariates using a pair of nested models by a likelihood ratio test Know how to interpret p-values and confidence intervals! Grand summary: modelling In all generalized linear models, we can use the following tools to make models more flexible: Adjust for confounders using additive covariates Effect modification allows by interaction terms Curved and bent lines through polynomials and splines