Correlation and simple linear regression S6

Transcription

1 Basic Medical Statistics Course Correlation and simple linear regression S6 Patrycja Gradowska December 3, /43

2 Introduction So far we have looked at the association between: Two categorical variables (chi-square test) Numerical variable and categorical variable (independent samples t-test and ANOVA) We will now look at the association between two numerical (continuous) variables, say and y 2/43

3 Introduction Eample 1: Mortality from malignant melanoma of the skin versus latitude of residency among white males in the United States (van Belle et al, 2004) Latitude Mortality rate # State (degrees North) (#deaths per 10 million) 1 1 Alabama Arizona Arkansas California Colorado Connecticut Delaware Wisconsin Wyoming How do we investigate the association between these two variables? 1 Mortality rate for the period /43

4 Scatter plot There is a roughly linear association 4/43

5 Relationship between two numerical variables If a linear relationship between and y appears to be reasonable from the scatter plot, we can take the net step and 1. Calculate Pearson s product moment correlation coefficient between and y Measures how closely the data points on the scatter plot resemble a straight line 2. Perform a simple linear regression analysis Finds the equation of the line that best describes the relationship between variables seen in a scatter plot 5/43

6 Correlation Sample Pearson s product moment correlation coefficient, or correlation coefficient, between variables and y is calculated as r(, y) = 1 n 1 n ( ) ( ) i yi ȳ i=1 s s y = 1 n 1 n z i z yi where {( i, y i ) : i = 1,..., n} is a random sample of n observations on and y, and ȳ are the sample means of respectively and y, s i=1 and s y are corresponding sample standard deviations, and z i are z-scores of and y for i-th observation. and z yi 6/43

7 Correlation Properties of r: r estimates the true population correlation coefficient (ρ) r takes on any value between 1 and 1, i.e. 1 r 1 Magnitude of r indicates the strength of a linear relationship between and y: r = 1 or 1 means perfect linear association r = 0 indicates no linear association (but can be e.g. non-linear) The closer r is to -1 or 1, the stronger the linear association (e.g. r = -0.1 (weak association) vs r = 0.85 (strong association)) Sign of r indicates the direction of association: r > 0 implies positive relationship i.e. the two variables tend to move in the same direction r < 0 implies negative relationship i.e. the two variables tend to move in opposite directions 7/43

8 Correlation Properties of r (cont): r(a + b, c y + d) = r(, y), where a > 0, c > 0, and b and d are constants r(, y) = r(y, ) r 0 does not imply causation! Just because two variables are correlated does not necessarily mean that one causes the other! r 2 is called the coefficient of determination r 2 is a number between 0 and 1 Represents the proportion of total variation in one variable that is eplained by the other For eample, the coefficient of determination between body weight and age of 0.60 means that 60% of total variation in body weight is eplained by age alone and the remaining 40% is eplained by other factors. 8/43

9 Correlation Correlation r= -1 r= 1 r= 0.8 r= -0.8 r= 0 r= 0 0 < r< 1-1 < r< 0 Don t interpret r without looking at the scatter plot! 9/43

10 Correlation Hypothesis test for the population correlation coefficient ρ: H 0 : ρ = 0 H 1 : ρ 0 Under H 0, the test statistic n 2 T = r 1 r 2 follows a Student-t distribution with n 2 degrees of freedom. Note: This test assumes that the variables are normally distributed 10/43

11 Correlation Eample 1 revisited: skin cancer mortality vs latitude 250,00 200,00 Mortality 150,00 100,00 50,00 25,00 30,00 35,00 40,00 45,00 50,00 Latitude What is the magnitude and sign of correlation coefficient between latitude and skin cancer mortality? 11/43

12 Correlation Eample 1 revisited: skin cancer mortality vs latitude SPSS output Correlations Mortality Latitude Pearson Correlation 1 -,825 ** Mortality Sig. (2-tailed),000 N Pearson Correlation -,825 ** 1 Latitude Sig. (2-tailed),000 N **. Correlation is significant at the 0.01 level (2-tailed). r p-value n 12/43

13 Pearson s product moment correlation coefficient measures the strength and direction of the linear association between and y But often times we are also interested in predicting the value of one variable given the value of the other This requires finding an equation (or mathematical model) that describes or summarizes the relationship between the variables If a scatter plot of our data shows an approimately linear relationship between and y we can use simple linear regression to estimate the equation of this line Regression, unlike correlation, requires that we have a dependent variable (or outcome or response variable), i.e. the variable being predicted (always on the vertical or y-ais) an independent variable (or eplanatory or predictor variable), i.e. the variable used for prediction (always on the horizontal or -ais) Let s assume that and y are the independent variable and the dependent variable, respectively 13/43

14 Simple linear regression postulates that in the population where: y is the dependent variable is the independent variable y = (α + β ) + ɛ, α and β are parameters called population regression coefficients ɛ is a random error term 14/43

15 y /43

16 y E(y i ) E(y i ) is the mean value of y when = i 16/43

17 y E(y i ) E(y ) = α + β E(y ) = α + β is the population regression function 17/43

18 y E(y ) = α + β 3β β α α is the y-intercept of the population regression function, i.e. the mean value of y when equals 0 β is the slope of the population regression function, i.e. the mean (or epected) change in y associated with a 1-unit increase in the value of c β is the mean change in y for a c-unit increase in the value of α and β are estimated from the sample data using the method of least squares (usually) 18/43

19 y = a + b y i e i ei = y i - i = residual i i 0 i Least squares method chooses a and b (estimates for α and β) to minimize the sum of the squares of the residuals n ei 2 = i=1 n (y i ŷ i ) 2 = i=1 n [y i (a + b i )] 2 i=1 19/43

20 The least squares estimates for α and β are: n i=1 b = ( i )(y i ȳ) n i=1 ( i ) 2 and a = ȳ b, where and ȳ are the respective sample means of and y. Note that: b = r(, y) sy s, where r(, y) is the sample product moment correlation between and y, and s and s y are the sample standard deviations of and y. 20/43

21 Relationship between slope b and correlation coefficient r r b unless s = s y r measures the strength of a linear association between and y while b measures the size of the change in the mean value of y due to a unit change in r does not distinguish between and y while b does r is scale-free while b is not But: r and b have the same sign both r and b do not imply causation both r and b can be affected by outliers r = 0 if and only if b = 0, thus test of β = 0 is equivalent to the test of ρ = 0 (i.e. no linear relationship) 21/43

22 Test of H 0 : β = 0 versus H 1 : β 0 1. t-test: Test statistic: T = b, where SE(b) is the standard error of b SE(b) calculated from the data Under H0, T follows a Student-t distribution with n 2 degrees of freedom 2. F-test: ( ) 2 Test-statistic: F = b SE(b) = T 2, where SE(b) and T are as above Under, H0, F follows an F distribution with 1 and n 2 degrees of freedom The t-test and the F-test lead to the same outcome Note: The test of zero intercept α is of less interest, unless = 0 is meaningful 22/43

23 Eample 2: blood pressure (mmhg) versus body weight (kg) in 20 patients with hypertension (Daniel & Cross, 2013) BP Weight 23/43

24 SPSS output: Coefficients a Model 1 (Constant) Weight Unstandardized Coefficients B Std. Error Beta t Sig a. From above, the regression equation is BP = Weight ANOVA a Model Sum of Squares df Mean Square F Sig. Regression 505, , ,859,000 b 1 Residual 54, ,029 Total 560, a. Dependent Variable: BP b. Predictors: (Constant), Weight F-test 24/43

25 Standardized coefficients Obtained by standardizing both y and (i.e. converting into z-scores) and re-running the regression After standardization, the intercept will be equal to zero and the slope for will be equal to the sample correlation coefficient Of greater concern in multiple linear regression (net lecture) where the predictors are epressed in different units Standardization removes the dependence of regression coefficients on the units of measurements of y and s so they can be meaningfully compared The larger the standardized coefficient (in absolute value) the greater the contribution of the respective variable in the prediction of y Standardized and unstandardized coefficients have the same sign and their significance tests are equivalent 25/43

26 Simple linear regression is only appropriate when the following assumptions are satisfied: 1. Independence: the observations are independent, i.e. there is only one pair of observations per subject 2. Linearity: the relationship between and y is linear 3. Constant variance: the variance of y is constant for all values of 4. Normality: y has a Normal distribution 26/43

27 27/43 Simple linear regression Checking linearity assumption: 1. Make a scatter plot of y versus If the assumption of linearity is met, the points in this plot should generally form a straight line 2. Plot the residuals against the eplanatory variable If the assumption of linearity is met, we should see a random scatter of points around zero rather than any systematic pattern 0 e Linearity 0 Lack of linearity e

28 28/43 Simple linear regression Checking constant variance assumption: Make a residual plot, i.e. plot the residuals against the fitted values of y (ŷ i = a + b i ) If the assumption is met, we epect to observe a random scatter of points If the scatter of the residuals increases or decreases as ŷ increases, then this assumption is not satisfied 0 e Constant variance 0 Non-constant variance e

29 Eample 2 revisited: blood pressure vs body weight Residual plot 29/43

30 Checking normality assumption: 1. Draw a histogram of the residuals and eyeball the result 2. Make a normal probability plot (P P plot) of the residuals, i.e. plot the epected cumulative probability of a normal distribution versus the observed cumulative probability at each value of the residual If the assumption of normality is met, the points in this plot should form a straight diagonal line 30/43

31 Eample 2 revisited: blood pressure vs body weight P P plot 31/43

32 Outliers Outlier is a data point that stands apart from the overall pattern seen in the scatter plot (i.e. unusual or unepected observation) It can be detected by looking at a scatter plot or residual plot We should always search for an eplanation for any outliers Common sources of outliers include: human and measurement errors during data collection and entry, sampling error and chance Some outliers can be corrected or removed, but some cannot In general, outliers that cannot be corrected should not be removed Outliers may influence the estimates of model parameters and thus the study conclusions In order to determine this influence, fit the line with and without the questionable points and see what happens 32/43

33 33/43 Simple linear regression Assessing goodness of fit The estimated regression line is the best one available (in the least-squares sense) Yet, it can still be a very poor fit to the observed data y Good fit Bad fit y

34 To assess goodness of fit of a regression line (i.e. how well does the line fit the data) we can: 1. Calculate the correlation coefficient between the predicted and observed values of y, R A higher absolute value of R indicates better fit (predicted and observed values of y are closer to each other) 2. Calculate R 2 (R Square in SPSS) 0 R 2 1 A higher value of R 2 indicates better fit R 2 = 1 indicates perfect fit (i.e. ŷ i = y i for each i) R 2 = 0 indicates very poor fit 34/43

35 Alternatively, R 2 can be calculated as n R 2 i=1 = (ŷ i ȳ) 2 variation in y eplained by n i=1 (y = i ȳ) 2 total variation in y We interpret R 2 as the proportion of total variability in y that can be eplained by the eplanatory variable An R 2 of 1 means that eplains all variability in y An R 2 of 0 indicates that does not eplain any variability in y R 2 is usually epressed as a percentage. For eample, R 2 = 0.93 indicates that 93% of total variation in y can be eplained by In SPSS, R 2 can be found in Model Summary table or it can be calculated from ANOVA table; both tables are produced when running linear regression 35/43

36 Eample 2 revisited: blood pressure vs body weight Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1,950 a,903,897 1,74050 a. Predictors: (Constant), Weight 36/43

37 Prediction: interpolation versus etrapolation y Possible patterns of additional data Range of actual data Etrapolation beyond the range of the data is risky!! 37/43

38 Categorical eplanatory variable So far we assumed that the predictor variable is numerical But what if we want to study an association between y and a categorical, e.g. between blood pressure and gender or between skin cancer mortality and race/ethnicity? Categorical variables can be incorporated into a regression model through one or more indicator or dummy variables that take on the values 0 and 1 In general, to include a variable with p categories/levels p 1 dummy variables are required 38/43

39 Categorical eplanatory variable Eample: variable with 4 categories, e.g. blood group (A, B, AB, 0) Basic steps: 1. Create dummy variables for all categories { 1, if blood group is A A = 0, otherwise { 1, if blood group is B B = 0, otherwise { 1, if blood group is AB AB = 0, otherwise { 1, if blood group is 0 0 = 0, otherwise 39/43

40 Categorical eplanatory variable In a dataset: Subject ID Blood group A B AB 0 1 A B AB B A B AB Select one blood group as a reference category category that results in useful comparisons (e.g. eposed versus non-eposed, eperimental versus standard treatment) or a category with large number of subjects 3. Include in the model all dummies ecept the one corresponding to the reference category 40/43

41 Categorical eplanatory variable Taking blood group 0 as reference category, the model becomes y = α + β A A + β B B + β AB AB + ɛ and its estimated counterpart is ŷ = a + b A A + b B B + b AB AB Estimation of model parameters requires running multiple linear regression (net lecture), unless the eplanatory variable has only two categories (e.g. gender) Given that y represents IQ score, the estimated coefficients are interpreted as follows: a is the mean IQ for subjects with blood group 0, i.e. the reference category Each b represents the mean difference in IQ between subjects with a blood group represented by the respective dummy variable and subjects with blood group 0 (the reference category) 41/43

42 Categorical eplanatory variable Specifically: b A is the difference between the mean IQ in subjects with blood group A and the mean IQ in subjects with blood group 0, i.e. b A = ŷ( A = 1, B = 0, AB = 0) a b B is the difference between the mean IQ in subjects with blood group B and the mean IQ in subjects with blood group 0, i.e. b B = ŷ( A = 0, B = 1, AB = 0) a b AB is the difference between the mean IQ in subjects with blood group AB and the mean IQ in subjects with blood group 0, i.e. b AB = ŷ( A = 0, B = 0, AB = 1) a Note: A test for the significance of a categorical eplanatory variable with p levels involves the hypothesis that the coefficients of all p 1 dummy variables are zero. For that purpose, we need to use an overall F-test (net lecture) and not a t-test. The t-test can be used only when the variable is binary. 42/43

43 References Gerald van Belle, Lloyd D. Fisher, Patrick J. Heagerty, Thomas Lumley Biostatistics: a methodology for the health sciences, 2nd edition. John Wiley & Sons, Wayne W. Daniel, Chad L. Cross Biostatistics: a foundation for analysis in the health sciences, 10th edition. John Wiley & Sons, /43