11. Analysis of Case-control Studies Logistic Regression

 To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video
Save this PDF as:

Size: px
Start display at page:

Transcription

1 Research methods II Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health: Research methods. We have so far considered situations where the outcome variable is numeric continuous and the explanatory variables are a mix of numeric and categorical ones. But in clinical work we often wish to evaluate the effects of multiple explanatory variables on a binary outcome variable. For example, the effects of a number of risk factors on the development or otherwise of a disease. A patient may be cured of a disease or not; a risk factor may be present or not; a person may be immunised or not, and so on. The situation now is different from multiple regression analysis which requires that the outcome variable be measured on a continuous scale. For example, in the data file on Crime in the Unites States, the outcome variable Crime Rate was continuous, and among the explanatory variables one (North/South) was categorical, the rest being all numerical. In the data file on Scores, the outcome variable Score was continuous, and both the independent variables (Sex and Teaching Method) were categorical. When the outcome variable is binary, and one wishes to measure the effects of several independent variables on it the method of analysis to use is Logistic Regression. The binary outcome variable is coded 0 and1. The convention is to associate 1 with success (e.g. patient survived; risk factor is present; correct answer is given, and so on), and 0 with failure. For Logistic Regression the following conditions need to be satisfied: 1. An outcome variable with the two possible categorical outcomes, preferably in the form 0 and 1. (A dummy variable may have to be created for the purpose). 2. We need to estimate the probability P of the observed value of the outcome variable out of We need a way of relating the outcome variable to the explanatory variables. This is done by means of the logistic function, which is explained later. 4. We need a way of estimating the coefficients of the regression equation developed. 5. We need to test the goodness of fit of the regression model, as well as estimate the confidence intervals of the coefficients. Recall that in multiple linear regression the basic activity is to draw the least square line around which the values of Y (the outcome variable) are distributed. In contrast to that, in logistic regression we are trying to estimate the probability that a given individual will fall into one outcome group or the other. Thinking in terms of probability helps to interpret the coefficients in the logistic regression model in a meaningful manner, in the same way as we attach meaning to the coefficients in linear regression. As a measure of the probability we ask about the odds of an event occurring. If P is the probability of a given event, then (1 P) is the probability of the event not occurring. The odds of the event can be defined as: Odds = p i.e. Probability of event occurring Probability of event not occurring. 1 p

2 114 Analysis of Case-control Studies In other words, P / (1 P) is the ratio of the probability of success, also called the odds of success. (When given the odds ratio, we can calculate probability from P = odds ratio (1 + odds ratio). If there are several factors involved in deciding the eventual outcome we can calculate the odds ratio for each one separately. (This kind of crude odds ratio was described in relation to case - control studies in Chapter 6 of Mother and Child Health: Research methods). The joint effect of all the independent variables put together may be expressed mathematically as: α+β1x1 + β2x2 +. +βpxp Odds = P / (1 P) = e Where e is a mathematical expression (like π). The numerical value of e is equal to The independent contribution of each of several factors (i.e. X 1, X 2 X p ) in the above expression to the overall odds ratio can be determined by taking the logarithms of both sides of the equation. α + β1x1 + β2x2 +.βpxp Log e {P / (1 P)} = log e The term log {P/(1 P)} is referred to as logistic transformation of the probability P, and is written as logit (p), which is short for logistic unit. Transforming the counted proportion p (so many successes out of so many trials) to logit gets rid of the drawback of probability which varies from 0 to 1, whereas the logit can vary from to +. And so we have now natural logarithm of (P / 1 P) = logit P = α + β 1 X 1 + β 2 X β p X p The right hand side of the equation is the same as in multiple linear regression with which we are familiar. Recall that in linear regression the parameters of the model are estimated using the least squares approach. What happens is that the coefficients selected are such that the sums of squared distances between the observed values and the predicted values (i.e. the regression line) are smallest. In logistic regression the parameters are estimated such that the coefficients make our observed results most likely. The method is called the maximum likelihood method. What does it all mean? Suppose in the above equation the two possible values of the outcome variable are presence or absence of disease, (with presence of disease =1). Then α represents the overall disease risk, while β 1 is the fraction by which the risk is increased (or decreased) by every unit change in X 1 ; β 2 is the fraction by which the disease risk is altered by a unit change in X 2, and so on. The independent variables can be quantitative or qualitative e.g. by using dummy variables. However, one should bear in mind that the logit of a proportion p is the logarithm of the corresponding odds. If an X variable has a coefficient β, then a unit increase in X increases the log odds by an amount equal to β. This means that the odds themselves are increased by a factor of e β. So if β = 1.6, the odds are e 1.6 = 4.95.The logarithms of all the odds can be converted back to probability by the formula Ρ = e y /(1+e y ). With this introduction about the principles of logistic regression we next examine how it works in practice by considering the following example.

3 Research methods II 115 Several tests have been developed for diagnosing coronary artery disease. Cardiac catheterization is the most definitive of these tests, but also most invasive. A non-invasive method is imaging the blood flow to the myocardium using the radioisotope thallium while the patient is made to exercise. 100 people (55 men and 45 women) had both the exercise thallium test and cardiac catheterisation. Some of them were on propranolol. Change in heart rate and E.C.G. during exercise were monitored, and occurrence of pain during exercise was recorded. The objective was to determine which variables were good predictors of a positive thallium test. The data are given below. In the data set the following codes apply: Thallium (1= positive scan; 0 = negative scan) is the outcome variable. The explanatory variables are: Degree of stenosis of coronary artery as percentage block (0% = no block; 100% = complete block) Use of propranolol prior to test (0 = no; 1 = yes). Maximum heart rate during exercise ( 0 = heart rate did not rise to 85% of maximum predicted rate; 1= heart rate exceeded 85% of predicted rate). Ischaemia during exercise (1 = occurred; 0 = no). Sex (0=male; 1=female). Chest pain during exercise ( 0 = no pain; 1 = moderate pain; 2 = severe pain). E.C.G. changes during exercise ( 0 = no; 1 = yes). (Am.J.Cardiol.1985;55:54-57)

4 116 Analysis of Case-control Studies Thall Stn Propran HrtRate IscExr Sex PnExr ID Thall Stn Propran HrtRate IscExr Sex PnExr ID

5 Research methods II We next perform the logistic regression analysis with Thallium as the binary outcome variable. Recall that Thall = 1 means positive scan; 0 = negative scan. Binary Logistic Regression [ In MINITAB Stat Regression Binary Logistic Regression] Link Function: Logit 1).Response Information Variable Value Count Thall 1 34 (Event) 0 66 Total 100 2).Logistic Regression Table Odds 95% CI Predictor Coef StDev Z P Ratio Lower Upper Constant Stn Propran HrtRate IscExr Sex PnExr ).Log-Likelihood = Test that all slopes are zero: G = , DF = 6, P-Value = ).Goodness-of-Fit Tests Method Chi-Square DF P Pearson Deviance Hosmer-Lemeshow ).Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

6 118 Analysis of Case-control Studies Group Value Total 1 Obs Exp Obs Exp Total ).Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Number Percent Summary Measures Concordant % Somers' D 0.51 Discordant % Goodman-Kruskal Gamma 0.52 Ties % Kendall's Tau-a 0.23 Total % A number of diagnostic graphs are available for checking the appropriateness of the model. These are described in the section Logistic Regression Diagnostics. Interpreting the results The different parts of the output have been numbered for ease of following the explanation. 1). Response Information provides information about the response variable Thallium. There were 34 subjects with a positive scan (Value = 1), described here as event. 2). Logistic regression table shows the estimates of the β coefficients their standard deviations, z values (z = β coefficient standard deviation; also called Wald Statistic), and p values. Also the odds ratio and 95% confidence intervals for each of the coefficients are printed. In order to understand the logistic coefficients let us first consider what is being tested. We are looking at the odds of events. By definition, the odds of an event occurring is the ratio of the probability that it will occur to the probability that it will not occur. In other words Probability (event) Probability (no event).the logistic model is about the logarithms of the odds, which is referred to as logit. So the model may be written as: Log e {Probability (event) / Probability (no event) } = Log { e α + β1x1 + β2x2 βkxk } = α + β 1 X 1 + β 2 X 2 β k X k From the equation we can interpret that the logistic coefficients are measures of the change in log odds (i.e. logit) associated with a one unit change in the explanatory variable if it is a continuous numerical variable. Thus the coefficient of Stnprcnt (% stenosis) gives the estimated change in log { P(+ve thallium scan) P( ve thallium scan)} for every 1% stenosis as determined by coronary angiography whilst holding other independent variables constant. And e β value gives the actual odds ratio. For example, e = ; e 0.6 = 1.822, and so on. Significant P value at the bottom of the regression table tells us that there is a significant association between at least one explanatory variable and the outcome by testing whether all slopes are equal to zero. If the P value were not significant there would be no need to go

7 Research methods II 119 further. Next we consider the P values for each term in the model. These values tell us whether or not there is statistically significant association between a particular explanatory variable and the outcome. 3). Log Likelihood tells us about the probability of the observed results given the β coefficients. Since the likelihood is always less than 1 it is customary to use the logarithm of the likelihood and multiply it by 2. Hence the term 2LL. Its value provides a measure of how well the model fits the data. The log likelihood statistic is analogous to the error sum of squares in multiple linear regression. As such it is an indicator of how much unexplained information remains after fitting the model. A large value means a poorly fitting model. The larger the value of the log likelihood the more unexplained observations there are. Therefore, a good model means a small value for 2LL. If a model fits perfectly, the likelihood is 1, and 2 log 1=0. One advantage of multiplying the log likelihood by 2 is that -2LL has an approximately chi square distribution and so makes it possible to compare values against those that arise by chance alone. Next to consider is the statistic G. G is the difference in -2 LL between a model that has only the constant term and the fitted model. G tests the null hypothesis that all coefficients of the independent variables equal zero (i.e. no slope) versus the coefficients not being equal to zero. In our case the P value is 0.045, showing that at least one of the coefficients is not equal to zero. 4. Goodness-of-Fit. How well the model fits the observed data is assessed by a number of ways as follows: A variety of statistical tests (Pearson; Deviance; Hosmer-Lemenshow) are applied with their P values determined in order to assess how well the model describes the data. A low P value indicates that the predicted probabilities deviate from the observed probabilities in a way that the binary distribution does not predict. For example, the Hosmer- Lemenshaw test assesses the fit of the model by comparing the observed and expected frequencies. The estimated probabilities are grouped from lowest to highest, and then the Chi Square test is carried out to determine if the observed and expected frequencies are significantly different. When the Hosmer- Lemenshaw test is significant (as it is here) it means that the observed counts and those predicted by the model are not close, and the model does not describe the data well. When the Hosmer-Lemenshaw test is not significant it means that the observed and the predicted counts are close and the model describes the data well. In our case here the P values range from to This means that the null hypothesis of data not fitting the model cannot be rejected. In other words the goodness-of-fit is not that good. Pearson and Deviance are both types of residuals. The larger the P value the better is the fit of the model to the data. In this case it would be best to try alternative models and opt for the one that produces largest P values. No model has an exact fit. With enough data the goodness of fit test will always reject the model. However, what one is looking for is whether the model is good enough for analysis purposes. One can still make inferences from a model not fitting well but caution is needed. 5). Table of Observed and expected frequencies. How well the model fits the data is next assessed by comparing observed and expected frequencies by estimated probabilities from lowest to highest. Group 1 contains data with lowest estimated probability; group 8 contains data with highest estimated probability. There is disparity at the levels of Groups 7 and 8 for both Thall = 1 and Thall = 0.

8 120 Analysis of Case-control Studies 6). Measures of Association. The table displays the numbers and percent of concordant, discordant and tied pairs, as well as the common correlation tests. These values measure the association between observed responses and predicted probabilities. There were 34 individuals with +ve and 66 who had ve results on thallium test. So in all there are 34 x 66 = 2244 pairs with different values. A pair is concordant if a subject with +ve thallium test has a higher probability of being positive, and discordant if negative. 74% of the pairs proved to be concordant. This can be used as a measure of prediction. A series of tests under Summary measures consider concordance discordance as follows: Somer s D is obtained by the following: how many more concordant than discordant pairs exist Total number of pairs Goodman-Kruskal-Gamma is obtained by the following: how many more concordant than discordant pairs Total number of pairs number of ties Kendall s Tau is obtained as follows: how many more concordant than discordant pairs Total number of pairs including pairs with same response Large values for Somer s D, Goodman-Kruskal-Gamma and Kendall s Tau a indicate that the model has a good predictive ability. In our case there were 74% pairs concordant and 23% discordant giving 50% better chance for pairs to be concordant than discordant. At the end a choice of diagnostic curves can be looked at. These are discussed under Regression Diagnostics. Comment The main statistical work of logistic regression is finding the best fitting coefficients for α, β 1, β 2,. β p when a set of data is analysed. The fit of the model is evaluated, and the coefficients are appraised to assess the impact of the individual variables. The basic approach is the same as in multiple linear regression, but the mathematics is different. Basic format for multiple logistic regression is that the outcome variable is binary, coded as 1= event; 0 = no event. The independent variables can be continuous, or categorical. The odds of the estimated probability are P (1 P). The logistic model is formatted as the logit or natural logarithm of the odds. The logistic expression is Log e {p/1-p} = α + β 1 X 1 + β 2 X 2 + β p X p for a full regression on all p variables. As mentioned above the logistic method can process any type of explanatory variables. The coefficients are interpreted as odds ratios. The explanatory variables act as predictors or explainers of the target event.

9 Research methods II 121 Choosing the best model. Decision about the best model is made using the log likelihood ratios. In this respect the log likelihood ratio in logistic regression is analogous to F and R 2 in linear regression. For the same set of data, two models with the same number of variables can be compared by using each model s value of log likelihood ratio. Inclusion of more variables would almost always improve the fit as it does also in linear regression. Therefore, when two models have different numbers of variables the comparison is not correct. Suppose a researcher wishes to assess the importance of a particular risk factor. One way would be to perform two separate sets of analysis, one including the risk factor of interest (usually called the full model), and the other without it (hence called the reduced model). The computer output for each model would give the log-likelihood statistic. All that the researcher has to do is to simply work out the difference to obtain the likelihood statistic related to the risk factor of interest. The difference between log likelihood statistic for two models, one of which is a special case of the other, has an approximate chi-square distribution in large samples. The degree of freedom for this chi-square test in our example would be one, because one parameter has been added in the form of one additional risk factor. If we had added two risk factors to the model the degree of freedom would have been two, and so on. So the hypothesis that the additional risk factor makes no difference is tested by computing χ 2 = 2 (log likelihood full model log likelihood reduced model). Having obtained the value of χ 2 its significance level can be looked up by entering the table of chi-square at one degree of freedom. A second approach that does not use the likelihood method is doing the Wald test. Recall that the Wald statistic is obtained by dividing the value of the β coefficient of the factor of interest by its standard error. For large samples the likelihood ratio and Wald 2 (Wald squared) give approximately the same value. If the study is large enough it does not matter which statistic is used. In the case of small and moderate sized samples the two statistics may give different results. In such a case the likelihood ratio statistic is better than Wald. So the rule of thumb is, when in doubt use likelihood ratio. Wald statistic is useful when one is fitting only one model viz. the full model. Wald statistic also provides information whether the β coefficient of an explanatory variable (predictor) is significantly different from zero. It is analogous to the t test in multiple linear regression. As such it is making a significant contribution to the prediction of the outcome. In interpreting the beta coefficients it has been pointed out that each coefficient is the natural logarithm of the odds ratio for the variable it represents. The odds ratio itself would be e raised to the power equal to the numerical value of the coefficient concerned. If the variable is categorical and coded as 1/ 0 the interpretation of the β coefficient becomes particularly interesting. Now we can calculate the probability of event when the particular risk factor is present or absent by first calculating the value of Y when β = 1 and when β = 0, and then substituting the two values of Y in the formula P = e y / (1+e y ). Conditional and unconditional logistic regression Until recently most widely used computer packages offered unconditional procedures. Now several include facilities for conditional estimates also. In deciding between the two the researcher needs to consider the number of parameters in the model relative to the number of

10 122 Analysis of Case-control Studies subjects. Unconditional logistic regression is preferred if the number of parameters is small relative to the number of subjects. For example, consider a case-control study of 100 matched pairs making the number of subjects 200. Because of matching 99 dummy variables would be created to represent each pair. Add to this the intercept and the risk factor of interest. The number of parameters would be a minimum of 103. This situation requires conditional logistic regression. A simple rule of thumb is use conditional logistic regression if matching has been done, and unconditional if there has been no matching. A second rule of thumb is when in doubt use conditional because it always gives unbiased results. The unconditional method is said to overestimate the odds ratio if it is not appropriate. Regression Diagnostics In the case of multiple linear regression diagnostics were based on residuals. Recall that there are two types of residuals raw residuals, which are the differences between the observed and fitted values of the outcome variables, and standardised residuals. In the case of logistic regression Residual = 1 (estimated probability). The software calculates residuals for each case and can standardise them to help evaluate the fit of the model. Several schemes are available for standardisation, and different computer packages use different methods. The common types of residuals are: Standardised Pearson Residuals. These are similar to the standardised residuals of the multiple linear regression. Deviance residuals. Residual deviance in logistic regression corresponds to the residual sum of squares in multiple linear regression. Standardised deviance residuals are often used to detect extreme or outlying observations. The Delta-beta figure shows how many observations are exercising undue influence on the β coefficients. A fourth type of residual is called deletion residual, and is based on the change observed when a particular observation is excluded from the data set. Some authors call them likelihood residuals. In all cases, large residuals indicate observations poorly fitted by the model. We now consider the diagnostic graphs available in MINITAB. These are in two groups: Those involving event probability Those involving leverage Diagnostic plots involving event probability Delta chi square vs. probability

11 Research methods II 123 This plot helps to identify patterns that did not fit well. Delta Chi square measures change in the Pearson goodness-of-fit statistics due to deleting a particular factor. Points that are away from the main cluster correspond to low predicted probability. In the graph above 3 data points appear to be far away from the rest. Delta Deviance vs. probability This plot helps to identify observations that have been fit well by the model. Delta deviance measures change in the deviance goodness-of-fit statistic due to deleting a particular observation. Large delta deviance values correspond to low predicted probabilities.

12 124 Analysis of Case-control Studies Delta beta (standardized) versus probability. This plot identifies observations that have a strong influence on the estimated regression coefficients. A large delta beta (standardized) corresponds to a large leverage and/or a large residual. Delta beta versus probability. This plot identifies observations that have a strong influence on the estimated regression coefficients. Delta-beta is a measure of change in the regression coefficients due to deleting a particular observation.

13 Research methods II 125 Plots involving leverage This plot identifies observations that have not been fit well in the model. Delta deviance versus leverage Delta deviance measures change in the deviance goodness-of-fit statistic due to deleting a particular observation.

14 126 Analysis of Case-control Studies Delta Beta (standardized) versus leverage The plot identifies observations that have a strong influence on the estimated regression coefficients. Delta beta (standardized) measures change in regression coefficients due to deleting a particular observation. Delta Beta versus leverage. The plot identifies observations that have a strong influence on the regression coefficients. A large delta-beta often corresponds to an observation with a large leverage and /or large residual. Practical Issues Extremely high values for parameter estimates (i.e. α and β coefficients) and for standard errors indicate that the number of explanatory variables is too large relative to the number of subjects. The solution is to either increase the number of subjects or remove one or more explanatory variables from the analysis. Like multiple linear regression, logistic regression also is sensitive to collinearity among the explanatory variables. The telltale sign of collinearity is high values for standard errors in parameter estimates.

15 Research methods II 127 It is not uncommon for one or more cases to be quite poorly predicted. A case that in reality is in one category of outcome, may be shown to have a high probability of belonging in another category. If there are several such cases then obviously the model is not fitting the data well. Diagnostic plots are useful for identifying the outlying cases.

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

Categorical Data Analysis

Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods

2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

How Do We Test Multiple Regression Coefficients?

How Do We Test Multiple Regression Coefficients? Suppose you have constructed a multiple linear regression model and you have a specific hypothesis to test which involves more than one regression coefficient.

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

MORE ON LOGISTIC REGRESSION

DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 MORE ON LOGISTIC REGRESSION I. AGENDA: A. Logistic regression 1. Multiple independent variables 2. Example: The Bell Curve 3. Evaluation

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

Ordinal Regression. Chapter

Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

Simple Linear Regression One Binary Categorical Independent Variable

Simple Linear Regression Does sex influence mean GCSE score? In order to answer the question posed above, we want to run a linear regression of sgcseptsnew against sgender, which is a binary categorical

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Logistic regression generalizes methods for 2-way tables Adds capability studying several predictors, but Limited to

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

Generalized Linear Models

Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

Logistic Regression. Introduction CHAPTER The Logistic Regression Model 14.2 Inference for Logistic Regression

Logistic Regression Introduction The simple and multiple linear regression methods we studied in Chapters 10 and 11 are used to model the relationship between a quantitative response variable and one or

13. Poisson Regression Analysis

136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

Regression Analysis: A Complete Example

Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

Yiming Peng, Department of Statistics. February 12, 2013

Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop

Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (

Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

Using Minitab for Regression Analysis: An extended example

Using Minitab for Regression Analysis: An extended example The following example uses data from another text on fertilizer application and crop yield, and is intended to show how Minitab can be used to

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

MULTIPLE REGRESSION WITH CATEGORICAL DATA

DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

data on Down's syndrome

DATA a; INFILE 'downs.dat' ; INPUT AgeL AgeU BirthOrd Cases Births ; MidAge = (AgeL + AgeU)/2 ; Rate = 1000*Cases/Births; LogRate = Log( (Cases+0.5)/Births ); LogDenom = Log(Births); age_c = MidAge - 30;

Lecture 16: Logistic regression diagnostics, splines and interactions. Sandy Eckel 19 May 2007

Lecture 16: Logistic regression diagnostics, splines and interactions Sandy Eckel seckel@jhsph.edu 19 May 2007 1 Logistic Regression Diagnostics Graphs to check assumptions Recall: Graphing was used to

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

Factors affecting online sales

Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4

E205 Final: Version B

Name: Class: Date: E205 Final: Version B Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The owner of a local nightclub has recently surveyed a random

Logistic regression diagnostics

Logistic regression diagnostics Biometry 755 Spring 2009 Logistic regression diagnostics p. 1/28 Assessing model fit A good model is one that fits the data well, in the sense that the values predicted

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

Inferential Statistics

Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

0.1 Multiple Regression Models

0.1 Multiple Regression Models We will introduce the multiple Regression model as a mean of relating one numerical response variable y to two or more independent (or predictor variables. We will see different

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY ABSTRACT Keywords: Logistic. INTRODUCTION This paper covers some gotchas in SAS R PROC LOGISTIC. A gotcha

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

Logistic Regression. Introduction. The Purpose Of Logistic Regression

Logistic Regression...1 Introduction...1 The Purpose Of Logistic Regression...1 Assumptions Of Logistic Regression...2 The Logistic Regression Equation...3 Interpreting Log Odds And The Odds Ratio...4

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

Examining a Fitted Logistic Model

STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic

7. Tests of association and Linear Regression

7. Tests of association and Linear Regression In this chapter we consider 1. Tests of Association for 2 qualitative variables. 2. Measures of the strength of linear association between 2 quantitative variables.

Simple linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1. 1. Introduction p. 2. 2. Statistical Methods Used p. 5. 3. 10 and under Males p.

Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1 Table of Contents 1. Introduction p. 2 2. Statistical Methods Used p. 5 3. 10 and under Males p. 8 4. 11 and up Males p. 10 5. 10 and under

Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests

Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

SUGI 29 Statistics and Data Analysis

Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

Logistic regression: Model selection

Logistic regression: April 14 The WCGS data Measures of predictive power Today we will look at issues of model selection and measuring the predictive power of a model in logistic regression Our data set

DATA INTERPRETATION AND STATISTICS

PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

Binary Diagnostic Tests Two Independent Samples

Chapter 537 Binary Diagnostic Tests Two Independent Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

Chapter 11: Two Variable Regression Analysis

Department of Mathematics Izmir University of Economics Week 14-15 2014-2015 In this chapter, we will focus on linear models and extend our analysis to relationships between variables, the definitions

5. Ordinal regression: cumulative categories proportional odds. 6. Ordinal regression: comparison to single reference generalized logits

Lecture 23 1. Logistic regression with binary response 2. Proc Logistic and its surprises 3. quadratic model 4. Hosmer-Lemeshow test for lack of fit 5. Ordinal regression: cumulative categories proportional

Simple Linear Regression

Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression Statistical model for linear regression Estimating

Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

Correlation key concepts:

CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)

AMS7: WEEK 8. CLASS 1. Correlation Monday May 18th, 2015

AMS7: WEEK 8. CLASS 1 Correlation Monday May 18th, 2015 Type of Data and objectives of the analysis Paired sample data (Bivariate data) Determine whether there is an association between two variables This

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

Multiple Regression Analysis in Minitab 1

Multiple Regression Analysis in Minitab 1 Suppose we are interested in how the exercise and body mass index affect the blood pressure. A random sample of 10 males 50 years of age is selected and their

Regression in SPSS. Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology

Regression in SPSS Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology John P. Bentley Department of Pharmacy Administration University of

SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

Introduction to Stata

Introduction to Stata September 23, 2014 Stata is one of a few statistical analysis programs that social scientists use. Stata is in the mid-range of how easy it is to use. Other options include SPSS,

Statistical Modeling Using SAS

Statistical Modeling Using SAS Xiangming Fang Department of Biostatistics East Carolina University SAS Code Workshop Series 2012 Xiangming Fang (Department of Biostatistics) Statistical Modeling Using

How to set the main menu of STATA to default factory settings standards

University of Pretoria Data analysis for evaluation studies Examples in STATA version 11 List of data sets b1.dta (To be created by students in class) fp1.xls (To be provided to students) fp1.txt (To be

Exercise 1.12 (Pg. 22-23)

Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

International Statistical Institute, 56th Session, 2007: Phil Everson

Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

4. Multiple Regression in Practice

30 Multiple Regression in Practice 4. Multiple Regression in Practice The preceding chapters have helped define the broad principles on which regression analysis is based. What features one should look

Linear Regression in SPSS

Linear Regression in SPSS Data: mangunkill.sav Goals: Examine relation between number of handguns registered (nhandgun) and number of man killed (mankill) checking Predict number of man killed using number

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

CRJ Doctoral Comprehensive Exam Statistics Friday August 23, :00pm 5:30pm

CRJ Doctoral Comprehensive Exam Statistics Friday August 23, 23 2:pm 5:3pm Instructions: (Answer all questions below) Question I: Data Collection and Bivariate Hypothesis Testing. Answer the following

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

Regression. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Class: Date: Regression Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Given the least squares regression line y8 = 5 2x: a. the relationship between

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship

II. DISTRIBUTIONS distribution normal distribution. standard scores

Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

where b is the slope of the line and a is the intercept i.e. where the line cuts the y axis.

Least Squares Introduction We have mentioned that one should not always conclude that because two variables are correlated that one variable is causing the other to behave a certain way. However, sometimes

Chapter 23. Inferences for Regression

Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily

Two Correlated Proportions (McNemar Test)

Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

10. Analysis of Longitudinal Studies Repeat-measures analysis

Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Statistics 305: Introduction to Biostatistical Methods for Health Sciences Modelling the Log Odds Logistic Regression (Chap 20) Instructor: Liangliang Wang Statistics and Actuarial Science, Simon Fraser

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

Module 4 - Multiple Logistic Regression

Module 4 - Multiple Logistic Regression Objectives Understand the principles and theory underlying logistic regression Understand proportions, probabilities, odds, odds ratios, logits and exponents Be

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

The scatterplot indicates a positive linear relationship between waist size and body fat percentage:

STAT E-150 Statistical Methods Multiple Regression Three percent of a man's body is essential fat, which is necessary for a healthy body. However, too much body fat can be dangerous. For men between the

The correlation coefficient

The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative

e = random error, assumed to be normally distributed with mean 0 and standard deviation σ

1 Linear Regression 1.1 Simple Linear Regression Model The linear regression model is applied if we want to model a numeric response variable and its dependency on at least one numeric factor variable.

SELF-TEST: SIMPLE REGRESSION

ECO 22000 McRAE SELF-TEST: SIMPLE REGRESSION Note: Those questions indicated with an (N) are unlikely to appear in this form on an in-class examination, but you should be able to describe the procedures