11. Analysis of Case-control Studies Logistic Regression

Size: px
Start display at page:

Transcription

1 Research methods II Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health: Research methods. We have so far considered situations where the outcome variable is numeric continuous and the explanatory variables are a mix of numeric and categorical ones. But in clinical work we often wish to evaluate the effects of multiple explanatory variables on a binary outcome variable. For example, the effects of a number of risk factors on the development or otherwise of a disease. A patient may be cured of a disease or not; a risk factor may be present or not; a person may be immunised or not, and so on. The situation now is different from multiple regression analysis which requires that the outcome variable be measured on a continuous scale. For example, in the data file on Crime in the Unites States, the outcome variable Crime Rate was continuous, and among the explanatory variables one (North/South) was categorical, the rest being all numerical. In the data file on Scores, the outcome variable Score was continuous, and both the independent variables (Sex and Teaching Method) were categorical. When the outcome variable is binary, and one wishes to measure the effects of several independent variables on it the method of analysis to use is Logistic Regression. The binary outcome variable is coded 0 and1. The convention is to associate 1 with success (e.g. patient survived; risk factor is present; correct answer is given, and so on), and 0 with failure. For Logistic Regression the following conditions need to be satisfied: 1. An outcome variable with the two possible categorical outcomes, preferably in the form 0 and 1. (A dummy variable may have to be created for the purpose). 2. We need to estimate the probability P of the observed value of the outcome variable out of We need a way of relating the outcome variable to the explanatory variables. This is done by means of the logistic function, which is explained later. 4. We need a way of estimating the coefficients of the regression equation developed. 5. We need to test the goodness of fit of the regression model, as well as estimate the confidence intervals of the coefficients. Recall that in multiple linear regression the basic activity is to draw the least square line around which the values of Y (the outcome variable) are distributed. In contrast to that, in logistic regression we are trying to estimate the probability that a given individual will fall into one outcome group or the other. Thinking in terms of probability helps to interpret the coefficients in the logistic regression model in a meaningful manner, in the same way as we attach meaning to the coefficients in linear regression. As a measure of the probability we ask about the odds of an event occurring. If P is the probability of a given event, then (1 P) is the probability of the event not occurring. The odds of the event can be defined as: Odds = p i.e. Probability of event occurring Probability of event not occurring. 1 p

2 114 Analysis of Case-control Studies In other words, P / (1 P) is the ratio of the probability of success, also called the odds of success. (When given the odds ratio, we can calculate probability from P = odds ratio (1 + odds ratio). If there are several factors involved in deciding the eventual outcome we can calculate the odds ratio for each one separately. (This kind of crude odds ratio was described in relation to case - control studies in Chapter 6 of Mother and Child Health: Research methods). The joint effect of all the independent variables put together may be expressed mathematically as: α+β1x1 + β2x2 +. +βpxp Odds = P / (1 P) = e Where e is a mathematical expression (like π). The numerical value of e is equal to The independent contribution of each of several factors (i.e. X 1, X 2 X p ) in the above expression to the overall odds ratio can be determined by taking the logarithms of both sides of the equation. α + β1x1 + β2x2 +.βpxp Log e {P / (1 P)} = log e The term log {P/(1 P)} is referred to as logistic transformation of the probability P, and is written as logit (p), which is short for logistic unit. Transforming the counted proportion p (so many successes out of so many trials) to logit gets rid of the drawback of probability which varies from 0 to 1, whereas the logit can vary from to +. And so we have now natural logarithm of (P / 1 P) = logit P = α + β 1 X 1 + β 2 X β p X p The right hand side of the equation is the same as in multiple linear regression with which we are familiar. Recall that in linear regression the parameters of the model are estimated using the least squares approach. What happens is that the coefficients selected are such that the sums of squared distances between the observed values and the predicted values (i.e. the regression line) are smallest. In logistic regression the parameters are estimated such that the coefficients make our observed results most likely. The method is called the maximum likelihood method. What does it all mean? Suppose in the above equation the two possible values of the outcome variable are presence or absence of disease, (with presence of disease =1). Then α represents the overall disease risk, while β 1 is the fraction by which the risk is increased (or decreased) by every unit change in X 1 ; β 2 is the fraction by which the disease risk is altered by a unit change in X 2, and so on. The independent variables can be quantitative or qualitative e.g. by using dummy variables. However, one should bear in mind that the logit of a proportion p is the logarithm of the corresponding odds. If an X variable has a coefficient β, then a unit increase in X increases the log odds by an amount equal to β. This means that the odds themselves are increased by a factor of e β. So if β = 1.6, the odds are e 1.6 = 4.95.The logarithms of all the odds can be converted back to probability by the formula Ρ = e y /(1+e y ). With this introduction about the principles of logistic regression we next examine how it works in practice by considering the following example.

3 Research methods II 115 Several tests have been developed for diagnosing coronary artery disease. Cardiac catheterization is the most definitive of these tests, but also most invasive. A non-invasive method is imaging the blood flow to the myocardium using the radioisotope thallium while the patient is made to exercise. 100 people (55 men and 45 women) had both the exercise thallium test and cardiac catheterisation. Some of them were on propranolol. Change in heart rate and E.C.G. during exercise were monitored, and occurrence of pain during exercise was recorded. The objective was to determine which variables were good predictors of a positive thallium test. The data are given below. In the data set the following codes apply: Thallium (1= positive scan; 0 = negative scan) is the outcome variable. The explanatory variables are: Degree of stenosis of coronary artery as percentage block (0% = no block; 100% = complete block) Use of propranolol prior to test (0 = no; 1 = yes). Maximum heart rate during exercise ( 0 = heart rate did not rise to 85% of maximum predicted rate; 1= heart rate exceeded 85% of predicted rate). Ischaemia during exercise (1 = occurred; 0 = no). Sex (0=male; 1=female). Chest pain during exercise ( 0 = no pain; 1 = moderate pain; 2 = severe pain). E.C.G. changes during exercise ( 0 = no; 1 = yes). (Am.J.Cardiol.1985;55:54-57)

4 116 Analysis of Case-control Studies Thall Stn Propran HrtRate IscExr Sex PnExr ID Thall Stn Propran HrtRate IscExr Sex PnExr ID

5 Research methods II We next perform the logistic regression analysis with Thallium as the binary outcome variable. Recall that Thall = 1 means positive scan; 0 = negative scan. Binary Logistic Regression [ In MINITAB Stat Regression Binary Logistic Regression] Link Function: Logit 1).Response Information Variable Value Count Thall 1 34 (Event) 0 66 Total 100 2).Logistic Regression Table Odds 95% CI Predictor Coef StDev Z P Ratio Lower Upper Constant Stn Propran HrtRate IscExr Sex PnExr ).Log-Likelihood = Test that all slopes are zero: G = , DF = 6, P-Value = ).Goodness-of-Fit Tests Method Chi-Square DF P Pearson Deviance Hosmer-Lemeshow ).Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)

6 118 Analysis of Case-control Studies Group Value Total 1 Obs Exp Obs Exp Total ).Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Number Percent Summary Measures Concordant % Somers' D 0.51 Discordant % Goodman-Kruskal Gamma 0.52 Ties % Kendall's Tau-a 0.23 Total % A number of diagnostic graphs are available for checking the appropriateness of the model. These are described in the section Logistic Regression Diagnostics. Interpreting the results The different parts of the output have been numbered for ease of following the explanation. 1). Response Information provides information about the response variable Thallium. There were 34 subjects with a positive scan (Value = 1), described here as event. 2). Logistic regression table shows the estimates of the β coefficients their standard deviations, z values (z = β coefficient standard deviation; also called Wald Statistic), and p values. Also the odds ratio and 95% confidence intervals for each of the coefficients are printed. In order to understand the logistic coefficients let us first consider what is being tested. We are looking at the odds of events. By definition, the odds of an event occurring is the ratio of the probability that it will occur to the probability that it will not occur. In other words Probability (event) Probability (no event).the logistic model is about the logarithms of the odds, which is referred to as logit. So the model may be written as: Log e {Probability (event) / Probability (no event) } = Log { e α + β1x1 + β2x2 βkxk } = α + β 1 X 1 + β 2 X 2 β k X k From the equation we can interpret that the logistic coefficients are measures of the change in log odds (i.e. logit) associated with a one unit change in the explanatory variable if it is a continuous numerical variable. Thus the coefficient of Stnprcnt (% stenosis) gives the estimated change in log { P(+ve thallium scan) P( ve thallium scan)} for every 1% stenosis as determined by coronary angiography whilst holding other independent variables constant. And e β value gives the actual odds ratio. For example, e = ; e 0.6 = 1.822, and so on. Significant P value at the bottom of the regression table tells us that there is a significant association between at least one explanatory variable and the outcome by testing whether all slopes are equal to zero. If the P value were not significant there would be no need to go

7 Research methods II 119 further. Next we consider the P values for each term in the model. These values tell us whether or not there is statistically significant association between a particular explanatory variable and the outcome. 3). Log Likelihood tells us about the probability of the observed results given the β coefficients. Since the likelihood is always less than 1 it is customary to use the logarithm of the likelihood and multiply it by 2. Hence the term 2LL. Its value provides a measure of how well the model fits the data. The log likelihood statistic is analogous to the error sum of squares in multiple linear regression. As such it is an indicator of how much unexplained information remains after fitting the model. A large value means a poorly fitting model. The larger the value of the log likelihood the more unexplained observations there are. Therefore, a good model means a small value for 2LL. If a model fits perfectly, the likelihood is 1, and 2 log 1=0. One advantage of multiplying the log likelihood by 2 is that -2LL has an approximately chi square distribution and so makes it possible to compare values against those that arise by chance alone. Next to consider is the statistic G. G is the difference in -2 LL between a model that has only the constant term and the fitted model. G tests the null hypothesis that all coefficients of the independent variables equal zero (i.e. no slope) versus the coefficients not being equal to zero. In our case the P value is 0.045, showing that at least one of the coefficients is not equal to zero. 4. Goodness-of-Fit. How well the model fits the observed data is assessed by a number of ways as follows: A variety of statistical tests (Pearson; Deviance; Hosmer-Lemenshow) are applied with their P values determined in order to assess how well the model describes the data. A low P value indicates that the predicted probabilities deviate from the observed probabilities in a way that the binary distribution does not predict. For example, the Hosmer- Lemenshaw test assesses the fit of the model by comparing the observed and expected frequencies. The estimated probabilities are grouped from lowest to highest, and then the Chi Square test is carried out to determine if the observed and expected frequencies are significantly different. When the Hosmer- Lemenshaw test is significant (as it is here) it means that the observed counts and those predicted by the model are not close, and the model does not describe the data well. When the Hosmer-Lemenshaw test is not significant it means that the observed and the predicted counts are close and the model describes the data well. In our case here the P values range from to This means that the null hypothesis of data not fitting the model cannot be rejected. In other words the goodness-of-fit is not that good. Pearson and Deviance are both types of residuals. The larger the P value the better is the fit of the model to the data. In this case it would be best to try alternative models and opt for the one that produces largest P values. No model has an exact fit. With enough data the goodness of fit test will always reject the model. However, what one is looking for is whether the model is good enough for analysis purposes. One can still make inferences from a model not fitting well but caution is needed. 5). Table of Observed and expected frequencies. How well the model fits the data is next assessed by comparing observed and expected frequencies by estimated probabilities from lowest to highest. Group 1 contains data with lowest estimated probability; group 8 contains data with highest estimated probability. There is disparity at the levels of Groups 7 and 8 for both Thall = 1 and Thall = 0.

8 120 Analysis of Case-control Studies 6). Measures of Association. The table displays the numbers and percent of concordant, discordant and tied pairs, as well as the common correlation tests. These values measure the association between observed responses and predicted probabilities. There were 34 individuals with +ve and 66 who had ve results on thallium test. So in all there are 34 x 66 = 2244 pairs with different values. A pair is concordant if a subject with +ve thallium test has a higher probability of being positive, and discordant if negative. 74% of the pairs proved to be concordant. This can be used as a measure of prediction. A series of tests under Summary measures consider concordance discordance as follows: Somer s D is obtained by the following: how many more concordant than discordant pairs exist Total number of pairs Goodman-Kruskal-Gamma is obtained by the following: how many more concordant than discordant pairs Total number of pairs number of ties Kendall s Tau is obtained as follows: how many more concordant than discordant pairs Total number of pairs including pairs with same response Large values for Somer s D, Goodman-Kruskal-Gamma and Kendall s Tau a indicate that the model has a good predictive ability. In our case there were 74% pairs concordant and 23% discordant giving 50% better chance for pairs to be concordant than discordant. At the end a choice of diagnostic curves can be looked at. These are discussed under Regression Diagnostics. Comment The main statistical work of logistic regression is finding the best fitting coefficients for α, β 1, β 2,. β p when a set of data is analysed. The fit of the model is evaluated, and the coefficients are appraised to assess the impact of the individual variables. The basic approach is the same as in multiple linear regression, but the mathematics is different. Basic format for multiple logistic regression is that the outcome variable is binary, coded as 1= event; 0 = no event. The independent variables can be continuous, or categorical. The odds of the estimated probability are P (1 P). The logistic model is formatted as the logit or natural logarithm of the odds. The logistic expression is Log e {p/1-p} = α + β 1 X 1 + β 2 X 2 + β p X p for a full regression on all p variables. As mentioned above the logistic method can process any type of explanatory variables. The coefficients are interpreted as odds ratios. The explanatory variables act as predictors or explainers of the target event.

9 Research methods II 121 Choosing the best model. Decision about the best model is made using the log likelihood ratios. In this respect the log likelihood ratio in logistic regression is analogous to F and R 2 in linear regression. For the same set of data, two models with the same number of variables can be compared by using each model s value of log likelihood ratio. Inclusion of more variables would almost always improve the fit as it does also in linear regression. Therefore, when two models have different numbers of variables the comparison is not correct. Suppose a researcher wishes to assess the importance of a particular risk factor. One way would be to perform two separate sets of analysis, one including the risk factor of interest (usually called the full model), and the other without it (hence called the reduced model). The computer output for each model would give the log-likelihood statistic. All that the researcher has to do is to simply work out the difference to obtain the likelihood statistic related to the risk factor of interest. The difference between log likelihood statistic for two models, one of which is a special case of the other, has an approximate chi-square distribution in large samples. The degree of freedom for this chi-square test in our example would be one, because one parameter has been added in the form of one additional risk factor. If we had added two risk factors to the model the degree of freedom would have been two, and so on. So the hypothesis that the additional risk factor makes no difference is tested by computing χ 2 = 2 (log likelihood full model log likelihood reduced model). Having obtained the value of χ 2 its significance level can be looked up by entering the table of chi-square at one degree of freedom. A second approach that does not use the likelihood method is doing the Wald test. Recall that the Wald statistic is obtained by dividing the value of the β coefficient of the factor of interest by its standard error. For large samples the likelihood ratio and Wald 2 (Wald squared) give approximately the same value. If the study is large enough it does not matter which statistic is used. In the case of small and moderate sized samples the two statistics may give different results. In such a case the likelihood ratio statistic is better than Wald. So the rule of thumb is, when in doubt use likelihood ratio. Wald statistic is useful when one is fitting only one model viz. the full model. Wald statistic also provides information whether the β coefficient of an explanatory variable (predictor) is significantly different from zero. It is analogous to the t test in multiple linear regression. As such it is making a significant contribution to the prediction of the outcome. In interpreting the beta coefficients it has been pointed out that each coefficient is the natural logarithm of the odds ratio for the variable it represents. The odds ratio itself would be e raised to the power equal to the numerical value of the coefficient concerned. If the variable is categorical and coded as 1/ 0 the interpretation of the β coefficient becomes particularly interesting. Now we can calculate the probability of event when the particular risk factor is present or absent by first calculating the value of Y when β = 1 and when β = 0, and then substituting the two values of Y in the formula P = e y / (1+e y ). Conditional and unconditional logistic regression Until recently most widely used computer packages offered unconditional procedures. Now several include facilities for conditional estimates also. In deciding between the two the researcher needs to consider the number of parameters in the model relative to the number of

10 122 Analysis of Case-control Studies subjects. Unconditional logistic regression is preferred if the number of parameters is small relative to the number of subjects. For example, consider a case-control study of 100 matched pairs making the number of subjects 200. Because of matching 99 dummy variables would be created to represent each pair. Add to this the intercept and the risk factor of interest. The number of parameters would be a minimum of 103. This situation requires conditional logistic regression. A simple rule of thumb is use conditional logistic regression if matching has been done, and unconditional if there has been no matching. A second rule of thumb is when in doubt use conditional because it always gives unbiased results. The unconditional method is said to overestimate the odds ratio if it is not appropriate. Regression Diagnostics In the case of multiple linear regression diagnostics were based on residuals. Recall that there are two types of residuals raw residuals, which are the differences between the observed and fitted values of the outcome variables, and standardised residuals. In the case of logistic regression Residual = 1 (estimated probability). The software calculates residuals for each case and can standardise them to help evaluate the fit of the model. Several schemes are available for standardisation, and different computer packages use different methods. The common types of residuals are: Standardised Pearson Residuals. These are similar to the standardised residuals of the multiple linear regression. Deviance residuals. Residual deviance in logistic regression corresponds to the residual sum of squares in multiple linear regression. Standardised deviance residuals are often used to detect extreme or outlying observations. The Delta-beta figure shows how many observations are exercising undue influence on the β coefficients. A fourth type of residual is called deletion residual, and is based on the change observed when a particular observation is excluded from the data set. Some authors call them likelihood residuals. In all cases, large residuals indicate observations poorly fitted by the model. We now consider the diagnostic graphs available in MINITAB. These are in two groups: Those involving event probability Those involving leverage Diagnostic plots involving event probability Delta chi square vs. probability

11 Research methods II 123 This plot helps to identify patterns that did not fit well. Delta Chi square measures change in the Pearson goodness-of-fit statistics due to deleting a particular factor. Points that are away from the main cluster correspond to low predicted probability. In the graph above 3 data points appear to be far away from the rest. Delta Deviance vs. probability This plot helps to identify observations that have been fit well by the model. Delta deviance measures change in the deviance goodness-of-fit statistic due to deleting a particular observation. Large delta deviance values correspond to low predicted probabilities.

12 124 Analysis of Case-control Studies Delta beta (standardized) versus probability. This plot identifies observations that have a strong influence on the estimated regression coefficients. A large delta beta (standardized) corresponds to a large leverage and/or a large residual. Delta beta versus probability. This plot identifies observations that have a strong influence on the estimated regression coefficients. Delta-beta is a measure of change in the regression coefficients due to deleting a particular observation.

13 Research methods II 125 Plots involving leverage This plot identifies observations that have not been fit well in the model. Delta deviance versus leverage Delta deviance measures change in the deviance goodness-of-fit statistic due to deleting a particular observation.

14 126 Analysis of Case-control Studies Delta Beta (standardized) versus leverage The plot identifies observations that have a strong influence on the estimated regression coefficients. Delta beta (standardized) measures change in regression coefficients due to deleting a particular observation. Delta Beta versus leverage. The plot identifies observations that have a strong influence on the regression coefficients. A large delta-beta often corresponds to an observation with a large leverage and /or large residual. Practical Issues Extremely high values for parameter estimates (i.e. α and β coefficients) and for standard errors indicate that the number of explanatory variables is too large relative to the number of subjects. The solution is to either increase the number of subjects or remove one or more explanatory variables from the analysis. Like multiple linear regression, logistic regression also is sensitive to collinearity among the explanatory variables. The telltale sign of collinearity is high values for standard errors in parameter estimates.

15 Research methods II 127 It is not uncommon for one or more cases to be quite poorly predicted. A case that in reality is in one category of outcome, may be shown to have a high probability of belonging in another category. If there are several such cases then obviously the model is not fitting the data well. Diagnostic plots are useful for identifying the outlying cases.

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

Categorical Data Analysis

Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods

MORE ON LOGISTIC REGRESSION

DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 MORE ON LOGISTIC REGRESSION I. AGENDA: A. Logistic regression 1. Multiple independent variables 2. Example: The Bell Curve 3. Evaluation

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

Ordinal Regression. Chapter

Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

Generalized Linear Models

Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Logistic regression generalizes methods for 2-way tables Adds capability studying several predictors, but Limited to

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

MULTIPLE REGRESSION WITH CATEGORICAL DATA

DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting

13. Poisson Regression Analysis

136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

Regression Analysis: A Complete Example

Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

Factors affecting online sales

Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY ABSTRACT Keywords: Logistic. INTRODUCTION This paper covers some gotchas in SAS R PROC LOGISTIC. A gotcha

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

Simple linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the

Examining a Fitted Logistic Model

STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic

Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests

Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

How to set the main menu of STATA to default factory settings standards

University of Pretoria Data analysis for evaluation studies Examples in STATA version 11 List of data sets b1.dta (To be created by students in class) fp1.xls (To be provided to students) fp1.txt (To be

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

SUGI 29 Statistics and Data Analysis

Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

Chapter 23. Inferences for Regression

Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily

Correlation key concepts:

CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

DATA INTERPRETATION AND STATISTICS

PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

4. Multiple Regression in Practice

30 Multiple Regression in Practice 4. Multiple Regression in Practice The preceding chapters have helped define the broad principles on which regression analysis is based. What features one should look

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

International Statistical Institute, 56th Session, 2007: Phil Everson

Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb 2 2015

Stat 412/512 CASE INFLUENCE STATISTICS Feb 2 2015 Charlotte Wickham stat512.cwick.co.nz Regression in your field See website. You may complete this assignment in pairs. Find a journal article in your field

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship

Exercise 1.12 (Pg. 22-23)

Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R. 1. Motivation. Likert items are used to measure respondents attitudes to a particular question or statement. One must recall

10. Analysis of Longitudinal Studies Repeat-measures analysis

Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.

LOGIT AND PROBIT ANALYSIS

LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 amitvasisht@iasri.res.in In dummy regression variable models, it is assumed implicitly that the dependent variable Y

Module 4 - Multiple Logistic Regression

Module 4 - Multiple Logistic Regression Objectives Understand the principles and theory underlying logistic regression Understand proportions, probabilities, odds, odds ratios, logits and exponents Be

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Statistics 305: Introduction to Biostatistical Methods for Health Sciences Modelling the Log Odds Logistic Regression (Chap 20) Instructor: Liangliang Wang Statistics and Actuarial Science, Simon Fraser

II. DISTRIBUTIONS distribution normal distribution. standard scores

Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

Two Correlated Proportions (McNemar Test)

Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

Binary Logistic Regression

Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

The correlation coefficient

The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative

Binary Diagnostic Tests Two Independent Samples

Chapter 537 Binary Diagnostic Tests Two Independent Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary

Introduction to Quantitative Methods

Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

How To Run Statistical Tests in Excel

How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

Descriptive Statistics

Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

Simple Regression Theory II 2010 Samuel L. Baker

SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

Simple Linear Regression

STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze

Logistic Regression (1/24/13)

STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

NEW YORK UNIVERSITY ROBERT F. WAGNER GRADUATE SCHOOL OF PUBLIC SERVICE Course Syllabus Spring 2016 Statistical Methods for Public, Nonprofit, and Health Management Section Format Day Begin End Building

Statistics 2014 Scoring Guidelines

AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home

Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

General Method: Difference of Means 1. Calculate x 1, x 2, SE 1, SE 2. 2. Combined SE = SE1 2 + SE2 2. ASSUMES INDEPENDENT SAMPLES. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

Yew May Martin Maureen Maclachlan Tom Karmel Higher Education Division, Department of Education, Training and Youth Affairs.

How is Australia s Higher Education Performing? An analysis of completion rates of a cohort of Australian Post Graduate Research Students in the 1990s. Yew May Martin Maureen Maclachlan Tom Karmel Higher

Chi-square test Fisher s Exact test

Lesson 1 Chi-square test Fisher s Exact test McNemar s Test Lesson 1 Overview Lesson 11 covered two inference methods for categorical data from groups Confidence Intervals for the difference of two proportions

Directions for using SPSS

Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

Logistic (RLOGIST) Example #1

Logistic (RLOGIST) Example #1 SUDAAN Statements and Results Illustrated EFFECTS RFORMAT, RLABEL REFLEVEL EXP option on MODEL statement Hosmer-Lemeshow Test Input Data Set(s): BRFWGT.SAS7bdat Example Using

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Dummy Coding for Dummies Kathryn Martin, Maternal, Child and Adolescent Health Program, California Department of Public Health ABSTRACT There are a number of ways to incorporate categorical variables into

Regression 3: Logistic Regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing