Basic Statistics and Data Analysis for Health Researchers from Foreign Countries



Similar documents
5. Linear Regression

Multiple Linear Regression

Statistical Models in R

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Correlation and Simple Linear Regression

Week 5: Multiple Linear Regression

11. Analysis of Case-control Studies Logistic Regression

Part 2: Analysis of Relationship Between Two Variables

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Chapter 7: Simple linear regression Learning Objectives

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Using R for Linear Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Multivariate Logistic Regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Introduction to Regression and Data Analysis

2. Simple Linear Regression

Testing for Lack of Fit

Simple linear regression

17. SIMPLE LINEAR REGRESSION II

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Comparing Nested Models

Final Exam Practice Problem Answers

Simple Linear Regression Inference

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Additional sources Compilation of sources:

Univariate Regression

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

DATA INTERPRETATION AND STATISTICS

Interpretation of Somers D under four simple models

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Logistic Regression (a type of Generalized Linear Model)

Least Squares Estimation

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Econometrics Simple Linear Regression

Descriptive Statistics

Categorical Data Analysis

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Section 3 Part 1. Relationships between two numerical variables

Exercise 1.12 (Pg )

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Example: Boats and Manatees

MTH 140 Statistics Videos

Part II. Multiple Linear Regression

Statistical Models in R

Section 1: Simple Linear Regression

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Regression Analysis: A Complete Example

MULTIPLE REGRESSION EXAMPLE

5. Multiple regression

STAT 350 Practice Final Exam Solution (Spring 2015)

Exchange Rate Regime Analysis for the Chinese Yuan

The importance of graphing the data: Anscombe s regression examples

Using Excel for inferential statistics

SPSS Guide: Regression Analysis

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Psychology 205: Research Methods in Psychology

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

1 Simple Linear Regression I Least Squares Estimation

Simple Linear Regression

Simple Predictive Analytics Curtis Seare

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

MIXED MODEL ANALYSIS USING R

Lucky vs. Unlucky Teams in Sports

Generalized Linear Models

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

The Dummy s Guide to Data Analysis Using SPSS

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

Ordinal Regression. Chapter

Correlation and Regression

Point Biserial Correlation Tests

Regression step-by-step using Microsoft Excel

Handling missing data in Stata a whirlwind tour

Basic Statistical and Modeling Procedures Using SAS

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Organizing Your Approach to a Data Analysis

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

n + n log(2π) + n log(rss/n)

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

2. Linear regression with multiple regressors

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

August 2012 EXAMINATIONS Solution Part I

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Principles of Hypothesis Testing for Public Health

How To Run Statistical Tests in Excel

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Transcription:

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1

Content Quantifying association between continuous variables. In particular: Correlation (Simple) regression Dias 2

Example Newly diagnosed Type 2 Diabetes pt glucose bmi sex age 1 1 15.3 25.16070 0 53.02669 2 2 12.1 22.96838 0 50.86653 3 4 13.4 34.37500 0 87.73990 4 5 14.0 26.16190 1 64.59411 5 6 13.8 35.07805 0 62.10815 6 7 13.8 26.71779 1 58.97604 7 8 16.2 27.18233 1 82.46133 8 9 8.5 33.70120 0 76.36687 9 10 17.3 28.67547 1 72.63792 10 11 8.6 26.21882 1 48.91170 11 12 17.0 27.43951 0 53.40999 12 13 15.4 32.67832 0 64.07392 13 14 7.8 24.05693 1 63.86858 14 15 16.4 25.12406 1 52.35318 15 16 7.4 33.13134 0 42.77618 16 17 11.6 30.12729 1 46.76797 17 19 14.2 33.07857 0 63.45517 18 20 14.4 29.24211 0 78.74333 19 21 11.6 21.24225 1 66.66940 A data set with 729 newly diagnosed Type 2 diabetes patients. pt: Patient ID glucose: Diagnostic plasma glucose (mmol/l) bmi: sex: age: Body Mass Index (kg/m2) sex (1=male, 0=female) age (years) Dias 3

Research question Do fat people have a more severe diabetes when the diabetes is discovered? Or in a more statistical language: Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis? Dias 4

Scatter-plot When investigating a potential association between only two variables (like diagnostic plasma glucose and BMI) a scatterplot is an important part of the analysis. It gives insight in the nature of the association. It shows problems in the data, e.g. outliers, strange or impossible values. Dias 5

Scatter-plot Dias 6

Scatter-plot There is no apparent tendency, specifically not one that would support our research question and if we have to point out a tendency, it would be that high BMI associates with lower diagnostic glucose (why is this not so strange if we think about the diagnosis of diabetes?). There seem to be some very large values, especially for diagnostic plasma glucose. These are valid measurements. Maybe a log transformation of glucose would make associations more apparent? Dias 7

Scatter-plot R code plot(diabetes$bmi,diabetes$glucose, frame=true, main=null, xlab= BMI (kg/m2), ylab= Glucose (mmol/l), col= green, pch=19) Dias 8

Scatter-plot log transformation Dias 9

Measures of association We want to capture the association between two variables in a single number: a correlation coefficient, a measure of association. Suppose that Y i is the diagnostic plasma glucose of patient i and X i the BMI for the same person. Then we want our measure of association to have the following characteristics: A positive association indicates that if X i is large (relative to the rest of the sample) then Y i is likely to be large as well. A negative association indicates that if X i is large then Y i is likely to be small. Dias 10

Measures of association between -1 and 1 0 : No association 1 : perfect positive association -1 : Perfect negative association Dias 11

Measures of association for the diabetes data r = -0.059 ρ = -0.050 τ = -0.034 Dias 12

Measures of association for the diabetes data and log transformed r = -0.053 ρ = -0.050 τ = -0.034 Only the first one changes! Dias 13

Pearson s correlation coefficient Pearson s correlation coefficient is computed from the data set (X i, Y i ), i = 1,,N as: X Y r = N i= 1 ( X where and are the respective means and SD x and SD y the respective standard deviations. i X )( Y ( N 1) SD SD x i Y ) y Dias 14

Characteristics of Pearson s correlation coefficient Pearson s correlation coefficient has the following properties: It measures the degree of linear association. It is invariant to linear change of scale for the variables. It is not robust to outliers. Coefficient values that are comparable between different data sets, and moreover a valid confidence interval and p-value, require that both X i and Y i are normally distributed. Dias 15

Pearson s correlation coefficient R code > cor(diabetes$bmi,diabetes$glucose,use= complete.obs ) [1] -0.05938123 Gives only the correlation coefficient. > cor.test(diabetes$bmi,diabetes$glucose) Pearson's product-moment correlation data: diabetes$bmi and diabetes$glucose t = -1.5995, df = 723, p-value = 0.1101 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.13162533 0.01349032 sample estimates: cor -0.05938123 Also performs a statistical test to see whether the coefficient is different from zero. Dias 16

Normally distributed? BMI Glucose A Normal distribution for comparison. Dias 17

Normally distributed? BMI Log(Glucose) Dias 18

Normally distributed? Dias 19

Normally distributed? Dias 20

R code A histogram of BMI: hist(diabetes$bmi,main= BMI,xlab= BMI (kg/m2),col= green ) A Normal Q-Q plot of BMI: qqnorm(diabetes$bmi,main= BMI,col= green ) qqline(diabetes$bmi,col= red ) And how do we get all these works of art in some decent format? jpeg(file= D:\mydirectory\mypicture.jpg,width=500,height=500) # # put here the code that generates the picture # dev.off() Dias 21

Rank correlation Spearman s ρ If data does not appear to be Normally distributed, or when there are outliers, one may instead compute the correlation between the ranks of the X i values and the ranks of the Y i values. This gives a nonparametric correlation coefficient called Spearman s ρ. It measures monotone association. It is invariant to monotone transformations (like a log transformation). It is robust to outliers. It has an odd interpretation. Dias 22

Spearman s rank correlation coefficient R code > cor.test(diabetes$bmi,diabetes$glucose,method= spearman ) Spearman's rank correlation rho data: diabetes$bmi and diabetes$glucose S = 66678220, p-value = 0.1801 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.04983743 Warning message: In cor.test.default(diabetes$bmi, diabetes$glucose, method = "spearman") : Cannot compute exact p-values with ties Dias 23

Rank correlation Kendall s τ A measure of monotone association with a more intuitive interpretation than Spearman s ρ is Kendall s τ. The observations from a pair of subjects i, j are and concordant if X i < X j and Y i < Y j or X i > X j and Y i > Y j discordant if X i < X j and Y i > Y j or X i > X j and Y i < Y j Kendall s τ is the difference between the probability for a concordant pair and the probability for a discordant pair. There are various versions of Kendall s τ depending on how ties are treated. Dias 24

Characteristics of Kendall s tau It measures monotone association. It is invariant to monotone transformations (like a log transformation). It is robust to outliers. It has a more straightforward interpretation than Spearman s rho. Dias 25

Kendall s rank correlation coefficient R code > cor.test(diabetes$bmi,diabetes$glucose,method= kendall ) Kendall's rank correlation tau data: diabetes$bmi and diabetes$glucose z = -1.3755, p-value = 0.169 alternative hypothesis: true tau is not equal to 0 sample estimates: tau -0.03427314 Dias 26

Correlation in the diabetes data r = -0.059 (p = 0.110) ρ = -0.050 (p = 0.180) τ = -0.034 (p = 0.169) Dias 27

Correlation in the diabetes data and log transformed r = -0.053 (p = 0.154) ρ = -0.050 (p = 0.180) τ = -0.034 (p = 0.169) Dias 28

Limitations of correlation coefficients While it is (relatively) clear what a correlation coefficient of 0 means, and also 1 or -1, it is often unclear what a highly significant correlation of, say, 0.5 means Correlation rarely answers the research question to a sufficient extend; because it is not easily interpretable. Coefficients of correlation depend on the sample selection and therefore we cannot compare values of the coefficients found in different data. Dias 29

Dias 30 Department of Biostatistics

Regression analysis An (intuitively interpretable) way to describe a (linear) association between two continuous type variables. It models a response Y (the dependent variable, the exogenous variable, the output) as a function of a predictor X (the independent variable, the exogenous variable, the explanatory variable, the covariate) and a term representing random other influences (error, noise). Dias 31

Regression model formulation We say: To regress Y on X or: To regress glucose on BMI Mathematically: Y i = α + βx i + ε i Where ε i are independently Normal distributed noise terms with mean 0 and standard deviation σ. Dias 32

Regression model The mean of Y is modelled with a linear function of X; a line in the X-Y plane. For each X, Y is a random variable Normally distributed around the modelled mean of Y, with standard deviation σ Dias 33

Scatter-plot with regression line Dias 34

Interpretation of the parameters We have variation due to a systematic part, the explanatory variable, and a random part, the noise. The systematic part of the model is defined by the regression line. α = the intercept: mean level for Y i when X i = 0 β = the slope: mean increase for Y i when X i is increased 1 unit. Dias 35

Research question Do fat people have a more severe diabetes when the diabetes is discovered? Or in a more statistical language: Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis? In a (simple) linear regression analysis, is the slope β different from 0 (or more pertinently, larger than 0)? Dias 36

How does the model answer the research question? Interest may focus on making a simple hypothesis about the two parameters: Null hypothesis : β = 0 Null hypothesis : α = 0 The second hypothesis often has no (clinical) meaning. Dias 37

Linear regression R code > mymodel <- lm(diabetes$glucose~diabetes$bmi) > summary(mymodel) Call: lm(formula = diabetes$glucose ~ diabetes$bmi) Residuals: Min 1Q Median 3Q Max -6.6974-3.5771-0.8535 2.3008 49.1636 Estimate of the slope P-value of the test for the null hypothesis β = 0. Coefficients: Estimate Std. Error t value Pr(> t ) Table with (Intercept) 14.96096 1.08396 13.80 <2e-16 *** parameter diabetes$bmi -0.05739 0.03588-1.60 0.110 estimates --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 4.976 on 723 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: 0.003526, Adjusted R-squared: 0.002148 F-statistic: 2.558 on 1 and 723 DF, p-value: 0.1101 Dias 38

Plot of regression line R code The lm() function can be used to plot the regression line in the scatter-plot: > plot(diabetes$bmi,diabetes$glucose) > mymodel <- lm(diabetes$glucose~diabetes$bmi) > abline(mymodel) Dias 39

Scatter-plot with regression line log transformed glucose Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.627159 0.069354 37.881 <2e-16 *** diabetes$bmi -0.003277 0.002296-1.428 0.154 Dias 40

How are the parameters estimated? The estimated parameters of the linear model define the line (found among all possible lines) which minimizes the squared distance between the data-points and the line in the scatter-plot. The estimation method is called ordinary least-squares (maximum likelihood gives the same answer). Dias 41

Least squares fit Dias 42

Does the model fit the data? Dias 43

Diagnostic plots Dias 44

Diagnostic plots R produces some diagnostic plots (of varying usefulness). The residuals (the error or noise) was supposed to be Normal distributed, this can be studied in the Q-Q plot (top right) More importantly, the residuals should have a single standard deviation, i.e. the variance should not increase with, for example, BMI. This can be studied in the residuals vs. fitted plot (top left) > mymodel <- lm(diabetes$glucose~diabetes$bmi) > opar <- par(mfrow = c(2,2), oma = c(0,0,1.1,0)) > plot(mymodel) > par(opar) Dias 45

Data transformations If the residuals are not Normal, or (and this is more serious because the central limit theorem deals with much of the non- Normality issue) if variance seems to increase with level, it may be a good idea to transform one or both variables. This is the real reason to investigate log(glucose) instead of glucose. Dias 46

Data transformations log transform Dias 47

The influence of one outlier Dias 48

Simpson s paradox Florida death penalty verdicts for homicide 1976-1987 relative to defendant s race White Black 11% (53/430) 8% (15/176) Dias 49

Simpson s paradox Victim white Victim black White Black 11% 23% (53/414) (11/37) 0% 3% (0/16) (4/139) Blacks tend to murder blacks and whites tend to murder whites and the murder of a white person has a higher probability of death penalty. For any victim the probability for a black person to get death penalty is about 2 times higher. Dias 50

Confounding Victim s race We are interested in the green highlighted association, but there is a correlation with the victim s race both with the defendant s race and the outcome of the trial. Defendant s race Death penalty Dias 51

Confounding A confounder influences both exposure and outcome Confounder When confounding is present we cannot interpret the green highlighted association as causal Exposure Outcome Dias 52

Randomization Exposure randomised Confounder Outcome Often there are many factors that may influence both exposure and outcome, some of them may not be observed or are unknown. If exposure is randomised, then there is no confounding. The green highlighted association can be interpreted causal. Dias 53

Two regressions The blue points denote patients with SBP>140 mmhg; the blue line the corresponding regression line. The red points denote patients with SBP < 140 mmhg; the red line the corresponding regression line. The black line is the general regression line. The slopes from the stratified analyses are less steep than the slope of the general line. Dias 54

Multiple regression > mymodel <- lm(log(diabetes$glucose)~diabetes$bmi+diabetes$sbp) > summary(mymodel) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.639870 0.069389 38.045 <2e-16 *** diabetes$bmi -0.002625 0.002308-1.137 0.2558 diabetes$sbp -0.054447 0.024168-2.253 0.0246 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 The adjusted slope (association) of bmi is less pronounced than before. SBP is related to both glucose and bmi and is a confounder. Dias 55

Multiple regression Adjusting a statistical analysis means to include other predictor variables into the model formula. Intuitively, a slope for BMI is determined for each level of the SBP variable separately and these are then averaged. including SBP in the analysis removes the confounding effect of SBP from the relationship between log(glucose) and BMI. Dias 56

Take home message Association between two continuous variables may be measured by correlation coefficients or in (simple) linear regression analysis. The latter provides arguably the best interpretable results. Moreover, it is straightforwardly extended to be able to deal with confounding, and more Dias 57