Logistic regression modeling the probability of success

Size: px
Start display at page:

Transcription

2 as a function of SAT levels off for high values of SAT (3) The situation is similar for low SAT (say around 600) At that point the probability of being admitted is so low that one point lower on the SAT subtracts little from the probability of being admitted In other words, the probability curve as a function of SAT levels off for low values of SAT These three patterns suggest that the probability curve is likely to have an S shape, as in the following picture Probability of success X Hopefully this informal argument is convincing, but this characteristic S shape also comes up clearly empirically Consider, the following example, which we will look at more carefully a bit later The data are from a survey of people taken in in Whickham, a mixed urban and rural district near Newcastle upon Tyne, United Kingdom Twenty years later a followup study was conducted Among the information obtained originally was the age of the respondent, and whether they were a smoker or not, and it was recorded if the person was still alive twenty years later c 2014, Jeffrey S Simonoff 2

3 Here is a scatter plot of the observed survival proportion by the midpoint of each age interval, separated by smoking status Note that the proportions do not follow a straight line, but rather an S shape 10 Nonsmoker Smoker Survival proportion Age S shaped curves can be fit using the logit (or logistic) function: ( ) p η ln = β 0 + β 1 x, (1) 1 p where p is the probability of a success The term p/(1 p) are the odds of success: p = 1 2 p = 2 3 p = 3 4 p = 4 5 p 1 p = 1 p 1 p = 2 p 1 p = 3 p 1 p = 4 Equation (1) is equivalent to p = eβ 0+β 1 x 1 + e β 0+β 1 x, (2) c 2014, Jeffrey S Simonoff 3

4 which gives the desired S shaped curve Different choices of β give curves that are steeper (larger β 1 ) or flatter (smaller β 1 ), or shaped as an inverted (backwards) S (negative β 1 ) Equations (1) and (2) generalize in the obvious way to allow for multiple predictors The logistic model says that the log odds follows a linear model Another way of saying this is that ˆβ 1 has the following interpretation: a one unit increase in x is estimated to be associated with multiplying the odds of success by eˆβ 1, holding all else in the model fixed The estimated odds of success when all predictors equal zero is obtained from the constant term as eˆβ 0 The fact that these values are based on exponentiation reinforces that it is a good idea to use units that are meaningful for your data, as otherwise you can get estimated odds ratios that are either extremely large, almost exactly 1, or virtually zero Note that since the logit is based on natural logs, there is a clear advantage to using the natural logarithm (base e) rather than the common logarithm (base 10) if you need to log a predictor: if this is done, the coefficient ˆβ 1 represents an elasticity of the odds So, for example, a coefficient ˆβ 1 = 2 means that a 1% increase in x is associated with a (roughly) 2% increase in the odds of success The logistic regression model is an example of a generalized linear model The model is that y i Binomial(1,p i ), with p i satisfying the logistic model (2) The parameters are estimated using maximum likelihood (OLS, WLS, and GLS are versions of maximum likelihood in Gaussian error regression models), which implies that the resultant estimated probabilities of success are the maximum likelihood estimates of the conditional probabilities of success given the observed values of the predictors The logit function is not the only function that yields S shaped curves, and it would seem that there is no reason to prefer the logit to other possible choices For example, another function that generates S shaped curves is the cumulative distribution function for the normal distribution; analysis using that function is called probit analysis Logistic regression (that is, use of the logit function) has several advantages over other methods, however The logit function is what is called the canonical link function, which means that parameter estimates under logistic regression are fully efficient, and tests on those parameters are better behaved for small samples Even more importantly, the logit function is the only choice with a very important c 2014, Jeffrey S Simonoff 4

6 between these two sampling approaches is that in a retrospective study the sampling rate is different for successes and for failures; that is, you deliberately oversample one group and undersample the other so as to get a reasonable number of observations in each group (it is not required that the two groups have the same number of observations, only that one group is oversampled while the other is undersampled) To simplify things, let s assume that initial debt is recorded only as Low or High This implies that the data take the form of a 2 2 contingency table (whatever the sampling scheme): Debt Bankrupt Yes No Low n LY n LN n L High n HY n HN n H n Y n N n Even though the data have the same form, whatever the sampling scheme, the way these data are generated is very different Here are two tables that give the expected counts in the four data cells, depending on the sampling scheme The π values are conditional probabilities (so, for example, π Y L is the probability of a business going bankrupt given that it has low initial debt): PROSPECTIVE SAMPLE RETROSPECTIVE SAMPLE Bankrupt Bankrupt Yes No Yes No Debt Low n L π Y L n L π N L Low n Y π Debt L Y n N π L N High n H π Y H n H π N H High n Y π H Y n N π H N Note that there is a fundamental difference between the probabilities that can be estimated using the two sampling schemes For example, what is the probability that a business goes bankrupt given that it has low initial debt? As noted above, this is π Y L It is easily estimated from a prospective sample (ˆπ Y L = n LY /n L ), as can be seen from the left table, but it is impossible to estimate it from a retrospective sample On the other hand, given that a business went bankrupt, what is the probability that it had low initial debt? That is π L Y, which is estimable from a retrospective sample (ˆπ L Y = n LY /n Y ), but not from a prospective sample c 2014, Jeffrey S Simonoff 6

7 This is where the advantage of logistic regression comes in While these different probabilities are not estimable for one sampling scheme or the other, the comparison between odds for the groupings that are relevant for that sampling scheme are identical in either scheme, and is always estimable Consider first a prospective study The odds are p/(1 p), where here p is the probability of a business going bankrupt For the low debt businesses, p = π Y L, so the odds are For the high debt businesses, the odds are π Y L 1 π Y L = π Y L π N L π Y H 1 π Y H = π Y H π N H Recall that the logistic regression model says that the logarithm of odds follows a linear model; that is, for this model, the ratio of the odds is a specified value This ratio of odds (called the cross product ratio) equals π Y L /π N L π Y H /π N H By the definition of conditional probabilities, this simplifies as follows: π Y L /π N L = π Y Lπ N H π Y H /π N H π Y H π N L = π LY π HN π L π H π HY π H π LN π L = π LY π HN π HY π LN This last value is easily estimated using the sample cross product ratio, n LY n HN n HY n LN Now consider a retrospective study The odds are p/(1 p), where here p is the probability of a business having low debt (remember that the proportion that are bankrupt is set by the design) For the bankrupt businesses, p = π L Y, so the odds are π L Y 1 π L Y = π L Y π H Y c 2014, Jeffrey S Simonoff 7

8 For the non bankrupt businesses, the odds are The cross product ratio equals π L N 1 π L N = π L N π H N π L Y /π H Y π L N /π H N By the definition of conditional probabilities, this simplifies as follows: π L Y /π H Y = π L Y π H N π L N /π H N π L N π H Y = π LY π HN π Y π N π LN π N π HY π Y = π LY π HN π LN π HY This, of course, is identical to the cross product ratio from the prospective study That is, while the type of individual probability that can be estimated from data depends on the sampling scheme, the cross product ratio is unambiguous whichever sampling scheme is appropriate The logit link function is the only choice where effects are determined by the cross product ratio, so it is the only link function with this property It might seem that a retrospective sampling scheme is a decidedly inferior choice, since we cannot answer the question of most interest; that is, what is the probability that a business goes bankrupt given its initial debt level? There is, however, a way that we can construct an answer to this question, if we have additional information Let π Y and π N be the true probabilities that a randomly chosen business goes bankrupt or does not go bankrupt, respectively These numbers are called prior probabilities Consider the probability of bankruptcy given a low initial debt level, π Y L By the definition of conditional probabilities, π Y L = π LY π L = π L Y π Y π L = π L Y π Y π LY + π LN π L Y π Y = π L Y π Y + π L N π N c 2014, Jeffrey S Simonoff 8

9 The probabilities π L Y and π L N are estimable in a retrospective study, so if we have a set of prior probabilities (π Y,π N ) we can estimate the probability of interest This point comes up again in other analyses involving group classification, such as discriminant analysis Since odds ratios are uniquely defined for both prospective and retrospective studies, the slope coefficients {β 1,β 2,} are also uniquely defined The constant term β 0, however, is driven by the observed proportions of successes and failures, so it is affected by the construction of the study Thus, as noted above, in a retrospective study, the results of a logistic regression fit cannot be used to estimate the probability of success If prior probabilities of success (π Y ) and failure (π N ) are available, however, the constant term can be adjusted, so that probabilities can be estimated The adjusted constant term has the form ) β 0 = ˆβ πy n N 0 + ln( π N n Y Thus, probabilities of success can be estimated for a retrospective study using equation (2) (allowing for multiple predictors, of course), with β 0 estimated using β 0 and β 1 estimated using ˆβ 1 Note that this adjustment is only done when the sample is a retrospective one, not if (for example) observations in a prospective study have been omitted because they are outliers If you re not sure about what (π Y,π N ) should be, you can try a range of values to assess the sensitivity of your estimated probabilities based on the adjusted intercept to the specific choice you make The logistic model (1) or (2) is a sensible one for probabilities, but is not necessarily appropriate for a particular data set This is not the same thing as saying that the predicting variables are not good predictors for the probability of success Consider the following two plots The variable X is a potential predictor for the probability of success, while the vertical axis gives the observed proportion of successes in samples taken at those values of X So, for example, X could be the dosage of a particular drug, and the target variable is the proportion of people in a trial that were cured when give that dosage c 2014, Jeffrey S Simonoff 9

10 Proportion of successes o o oooo o o oooo o oo ooooo Proportion of successes oooo o o oo o o o oo o oo oo o o X X In the plot on the left, X is very useful for predicting success, but the logistic regression model does not fit the data, since the probability of success is not a monotone function of X In the plot on the right, X is not a useful predictor for success (the probability of success appears to be unrelated to X), but the logistic regression model fits the data, as a very flat S shaped curve goes through the observed proportions of success reasonably well Minitab provides three goodness of fit tests that test the hypotheses H 0 : the logistic model fits the data versus H a : the logistic model does not fit the data that can be used to determine if the correct model is being fit What if it isn t? Then the model (1) should be enriched to allow for the observed pattern For the nonmonotonic pattern in the left plot above, obvious things to try are to fit a quadratic model by includingx 2 c 2014, Jeffrey S Simonoff 10

11 as a predictor, or a model where X is considered nominal with three levels corresponding to low, medium, and high A variation on this theme is to find and use additional predictors Note that one reason why the logistic model might not fit is that the relationship between logits and predictors might be multiplicative, rather than additive; so, for example, if a predictor is long right tailed, you might find that using a logged predictor is more effective than using an unlogged one Unusual observations (outliers, leverage points) are also still an issue for these models, and Minitab provides diagnostics for them Another source of lack of fit of a logistic regression model is that the binomial distribution isn t appropriate This can happen, for example, if there is correlation among the responses, or if there is heterogeneity in the success probabilities that hasn t been modeled Both of these violations can lead to overdispersion, where the variability of the probability estimates is larger than would be implied by a binomial random variable In that case you would need to use methods that correct for the overdispersion, either implicitly (such as through what is known as quasi-likelihood) or explicitly (fitting a regression model using a response distribution that incorporates overdispersion, such as the beta-binomial distribution) All of these issues are covered in my book Analyzing Categorical Data, which was published by Springer-Verlag in 2003 Logistic regression also can be used for prediction of group membership, or classification A fitted model (with estimated ˆβ) can be applied to new data to give an estimate of the probability of success for that observation using equation (2) If a simple success/failure prediction is desired, this can be done by calling an estimated ˆp less than 5 a predicted failure, and an estimated ˆp greater than 5 a success Other cutoff values might make sense also For example, the prior probability of success, often called the base rate (if one is available) or the observed sample proportion of successes is often more effective if those values are far from 5 (that is, if only 30% of the observations are successes, say, it s reasonable to set the cutoff in the classification matrix as 3) Another way to classify is to choose a cutoff that takes into account the costs of misclassification into success and failure If this is done to the same data used to fit the logistic regression model, a classification table results Here is a hypothetical example of such a table: c 2014, Jeffrey S Simonoff 11

12 Predicted group Success Failure Actual Success group Failure In this example 750% of the observations (15 out of 20) were correctly classified to the appropriate group Is that a lot? More formally, is that more than we would expect just by random chance? The answer to this question is not straightforward, because we have used the same data to both build the model (getting ˆβ) and evaluate its ability to do predictions (that is, we ve used the data twice) For this reason, the observed proportion correctly classified can be expected to be biased upwards compared to if the model was applied to completely new data What can we do about this? The best thing to do is what is implied by the last line in the previous paragraph get some new data, and apply the fitted model to it to see how well it does (that is, validate the model on new data) What if no new data are available? Two diagnostics that have been suggested can help here One approach we could take would be to say the following: if we simply predicted every observation to have come from the larger group, what proportion would we have gotten right? For these data, that is ( 11 C max = max 20, 9 ) = 550% 20 We certainly hope to do better than this naive rule, so C max provides a useful lower bound for what we would expect the observed proportion correctly classified to be The observed 750% correctly classified is quite a bit larger than C max, supporting the usefulness of the logistic regression A more sophisticated argument is as follows If the logistic regression had no power to make predictions, the actual group that an observation falls into would be independent of the predicted group That is, for example, P(Actual group a success and Predicted group a success) = P(Actual group a success) P(Predicted group a success) c 2014, Jeffrey S Simonoff 12

13 The right side of this equation can be estimated using the marginal probabilities from the classification table, yielding P(Actual group a success and Predicted group a success) = ( = 275 ) ( ) That is, we would expect to get 275% of the observations correctly classified as successes just by random chance A similar calculation for the failures gives ( ) ( ) P(Actual group a failure and 9 10 Predicted group a failure) = = 225 Thus, assuming that the logistic regression had no predictive power for the actual group (success or failure), we would expect to correctly classify 275% + 225% = 50% of the observations This still doesn t take into account that we ve used the data twice, so this number is typically inflated by 25% before being reported; that is, C pro = % = 625% The observed 75% is still considerably higher than C pro, supporting the usefulness of the logistic regression At the beginning of this handout I mentioned the modeling of bankruptcy as an example of a potential logistic regression problem There is a long history of at least 35 years of trying to use publicly available data to predict whether or not a company will go bankrupt, and it is clear at this point from many empirical studies that it is possible to do so, at least up to four years ahead That is, classification tables show better classificatory performance than would have been expected by chance Interestingly, however, this does not translate into the ability to make money, as apparently stock price reaction precedes the forecast, providing yet another argument in support of the efficient market hypothesis It is possible that a logistic regression data set constitutes a time series, and in that case autocorrelation can be a problem (recall that one of the assumptions here is that the Bernoulli observations be independent of each other) Time series models for binary response data are beyond the scope of this course, but there are at least a few things you c 2014, Jeffrey S Simonoff 13

14 can do to assess whether autocorrelation might be a problem If the data form a time series, it is sensible to plot the standardized Pearson residuals in time order to see if there are any apparent patterns You can also construct an ACF plot of the residuals, and construct a runs test for them You should recognize, however, that the residuals are only roughly normally distributed (if that), and often have highly asymmetric distributions, so these tests can only be viewed as suggestive of potential problems If you think that you might have autocorrelation, one potential corrective procedure to consider is to use a lagged version of the response as a predictor Logistic regression can be generalized to more than two groups Polytomous (or nominal) logistic regression, the generalization of a binomial response with two groups to a multinomial response with multiple groups, proceeds by choosing (perhaps arbitrarily) one group as the control or standard Say there are K groups, and group K is the one chosen as the standard The logistic model states that the probability of falling into group j given the set of predictor values x is Equivalently, P(group j x) = exp(β 0j + β 1j x β pj x p ) 1 + K 1 l=1 exp(β 0l + β 1l x β pl x p ) ( ) pj ln = β 0j + β 1j x β pj x p p K If group K can be thought of as a control group, for example, this is a natural model where the log odds compared to the control is linear in the predictors (with different coefficients for each of the groups) So, for example, in a clinical trial context for a new drug, the control group might be No side effects, while the other groups are Headache, Back pain, and Dizziness If group K is chosen arbitrarily, it is harder to interpret the coefficients of the model Each of these models could be fit separately using ordinary (binary) logistic regression, but the resultant coefficient estimates would not be exactly the same More importantly, polytomous logistic regression allows the construction of a test of the statistical significance of the variables in accounting for the simultaneous difference in probabilities of all of the response categories, not just individual groups versus the control group Further, the estimated probability vector (of the probabilities of falling into each group) for an observation is guaranteed to be the same no matter which group is chosen as c 2014, Jeffrey S Simonoff 14

15 the reference group when fitting a polytomous logistic regression model If it was desired to classify observations, they would then be assigned to the group with largest estimated probability There is one additional assumption required for the validity of the multinomial logistic regression model the assumption of independence of irrelevant alternatives, or IIA This assumption states that the odds of one group relative to another is unaffected by the presence or absence of other groups A test for this condition exists (the so-called Hausman- McFadden test), although it is not provided in Minitab An alternative approach that does not assume IIA is the multinomial probit model (also not provided in Minitab), although research indicates that the multinomial logit model is a more effective model even if the IIA assumption is violated A classification matrix can be formed when there are more than two groups the same way that it is when there are two groups The two benchmarks C max and C pro are formed in an analogous way C max is the proportion of the sample that comes from the largest of the groups in the sample To calculate C pro, calculate the proportions of the sample that come from each group and the proportions of the sample that are predicted to be from each group; then, multiply the actual group proportion times the predicted group proportion for each group and sum over all of the groups; finally, multiply by 125 In many situations there is a natural ordering to the groups, which we would like to take into account For example, respondents to a survey might be asked their opinion on a political issue, with responses coded on a 5 point Likert scale of the form Strongly disagree Disagree Neutral Agree Strongly agree Is there some way to analyze this type of data set? There are two affirmative answers to that question a common, but not very correct approach, and a not so common, but much more correct approach The common approach is to just use ordinary least squares regression, with the group membership identifier as the target variable This often doesn t work so badly, but there are several problems with its use in general: (1) An integral target variable clearly is inconsistent with continuous (and hence Gaussian) errors, violating one of the assumptions when constructing hypothesis tests and confidence intervals c 2014, Jeffrey S Simonoff 15

16 (2) Predictions from an ordinary regression model are of course nonintegral, resulting in difficulties in interpretation: what does a prediction to group 2763 mean, exactly? (3) The least squares regression model doesn t answer the question that is of most interest here what is the estimated probability that an observation falls into a particular group given its predictor values? (4) The regression model implicitly assumes that the groups are equally distant from each other, in the sense that the lowest group is as far from the second lowest as the highest is from second highest It could easily be argued that this is not sensible for a given data set The solution to these problems is a generalization of the logistic regression model to ordinal data The model is still based on logits, but now they are cumulative logits The most common cumulative logit model (and the one fitted by Minitab) is called the proportional odds model For simplicity, consider a model with one predictor x and target group identifier Y that is a measure of agreement rated from lowest to highest Let [ ] Fj (x) L j (x) = logit[f j (x)] = ln, j = 1,,J 1, 1 F j (x) where F j (x) = P(Y j x) is the cumulative probability for response category j, and J is the total number of response categories (so Y takes on values {1,,J}, corresponding to strong disagreement to strong agreement) The proportional odds model says that L j (x) = α j + βx, j = 1,,J 1 Note that this means that a positive β is associated with increasing odds of being less than a given value j; that is, a positive coefficient implies increasing probability of being in lower numbered categories with increasing x This model can be motivated using an underlying continuous response variable Y representing, in some sense, agreement Say Y follows the logistic distribution with location β 0 + βx for given value of x Suppose = α 0 < α 1 < < α J = are such that the observed response (university rating) Y satisfies Y = j if α j 1 < Y α j c 2014, Jeffrey S Simonoff 16

17 That is, we observe a response in category j when the underlying Y falls in the jth interval of values Then, if the underlying response is linearly related to x, the proportional odds model is the representation for the underlying probabilities Thus, the model allows for the same idea of linearity that a least squares regression model does, but allows for unequally spaced groups through the estimates of α That is, for example, for a 5 point Likert scale target the model allows for the possibility that people who strongly disagree with a statement are more like people who disagree (responses 1 and 2), than are people who disagree and people who are neutral (responses 2 and 3) Here is a graphical representation of what is going on (this is actually Figure 103 from my Analyzing Categorical Data book): Y* α4 α 3 α2 α 1 x1 x2 x3 X You might wonder about the logistic distribution mentioned before This distribution is similar to the normal distribution, being symmetric but slightly longer tailed A version of the proportional odds model that assumes a normal distribution for the underlying variable Y would be based on probits, rather than logits c 2014, Jeffrey S Simonoff 17

11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

13. Poisson Regression Analysis

136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

Ordinal Regression. Chapter

Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

c 2015, Jeffrey S. Simonoff 1

Modeling Lowe s sales Forecasting sales is obviously of crucial importance to businesses. Revenue streams are random, of course, but in some industries general economic factors would be expected to have

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable

LOGIT AND PROBIT ANALYSIS

LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 amitvasisht@iasri.res.in In dummy regression variable models, it is assumed implicitly that the dependent variable Y

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

Generalized Linear Models

Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

Organizing Your Approach to a Data Analysis

Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

Introduction to Quantitative Methods

Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

Chi Square Tests. Chapter 10. 10.1 Introduction

Contents 10 Chi Square Tests 703 10.1 Introduction............................ 703 10.2 The Chi Square Distribution.................. 704 10.3 Goodness of Fit Test....................... 709 10.4 Chi Square

5. Multiple regression

5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

Simple Linear Regression

STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

4. Simple regression. QBUS6840 Predictive Analytics. https://www.otexts.org/fpp/4

4. Simple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/4 Outline The simple linear model Least squares estimation Forecasting with regression Non-linear functional forms Regression

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 13 and Accuracy Under the Compound Multinomial Model Won-Chan Lee November 2005 Revised April 2007 Revised April 2008

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these

Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

Logit and Probit. Brad Jones 1. April 21, 2009. University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Logit and Probit Brad 1 1 Department of Political Science University of California, Davis April 21, 2009 Logit, redux Logit resolves the functional form problem (in terms of the response function in the

Logit Models for Binary Data

Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

Recall this chart that showed how most of our course would be organized:

Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables Introduction In the summer of 2002, a research study commissioned by the Center for Student

Simple Random Sampling

Source: Frerichs, R.R. Rapid Surveys (unpublished), 2008. NOT FOR COMMERCIAL DISTRIBUTION 3 Simple Random Sampling 3.1 INTRODUCTION Everyone mentions simple random sampling, but few use this method for

Nonlinear Regression Functions. SW Ch 8 1/54/

Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

How To Check For Differences In The One Way Anova

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

Simple linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA - Part I

Index Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1 EduPristine CMA - Part I Page 1 of 11 Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting

Biostatistics: Types of Data Analysis

Biostatistics: Types of Data Analysis Theresa A Scott, MS Vanderbilt University Department of Biostatistics theresa.scott@vanderbilt.edu http://biostat.mc.vanderbilt.edu/theresascott Theresa A Scott, MS

Regression 3: Logistic Regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing

2. What is the general linear model to be used to model linear trend? (Write out the model) = + + + or

Simple and Multiple Regression Analysis Example: Explore the relationships among Month, Adv.\$ and Sales \$: 1. Prepare a scatter plot of these data. The scatter plots for Adv.\$ versus Sales, and Month versus

Algebra 1 Course Title

Algebra 1 Course Title Course- wide 1. What patterns and methods are being used? Course- wide 1. Students will be adept at solving and graphing linear and quadratic equations 2. Students will be adept

2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

ANALYSIS OF TREND CHAPTER 5

ANALYSIS OF TREND CHAPTER 5 ERSH 8310 Lecture 7 September 13, 2007 Today s Class Analysis of trends Using contrasts to do something a bit more practical. Linear trends. Quadratic trends. Trends in SPSS.

Multiple Choice Models II

Multiple Choice Models II Laura Magazzini University of Verona laura.magazzini@univr.it http://dse.univr.it/magazzini Laura Magazzini (@univr.it) Multiple Choice Models II 1 / 28 Categorical data Categorical

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

Nominal and ordinal logistic regression

Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

MGT 267 PROJECT Forecasting the United States Retail Sales of the Pharmacies and Drug Stores Done by: Shunwei Wang & Mohammad Zainal Dec. 2002 The retail sale (Million) ABSTRACT The present study aims

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

DATA INTERPRETATION AND STATISTICS

PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

Univariate Regression

Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

SUGI 29 Statistics and Data Analysis

Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

Local classification and local likelihoods

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

MULTIPLE REGRESSION WITH CATEGORICAL DATA

DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting

Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

Forecasting in supply chains

1 Forecasting in supply chains Role of demand forecasting Effective transportation system or supply chain design is predicated on the availability of accurate inputs to the modeling process. One of the

Choice under Uncertainty

Choice under Uncertainty Part 1: Expected Utility Function, Attitudes towards Risk, Demand for Insurance Slide 1 Choice under Uncertainty We ll analyze the underlying assumptions of expected utility theory

Introduction to Statistics and Quantitative Research Methods

Introduction to Statistics and Quantitative Research Methods Purpose of Presentation To aid in the understanding of basic statistics, including terminology, common terms, and common statistical methods.

In this chapter, you will learn improvement curve concepts and their application to cost and price analysis.

7.0 - Chapter Introduction In this chapter, you will learn improvement curve concepts and their application to cost and price analysis. Basic Improvement Curve Concept. You may have learned about improvement

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

EC247 FINANCIAL INSTRUMENTS AND CAPITAL MARKETS TERM PAPER

EC247 FINANCIAL INSTRUMENTS AND CAPITAL MARKETS TERM PAPER NAME: IOANNA KOULLOUROU REG. NUMBER: 1004216 1 Term Paper Title: Explain what is meant by the term structure of interest rates. Critically evaluate

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

Inequality, Mobility and Income Distribution Comparisons

Fiscal Studies (1997) vol. 18, no. 3, pp. 93 30 Inequality, Mobility and Income Distribution Comparisons JOHN CREEDY * Abstract his paper examines the relationship between the cross-sectional and lifetime

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE MAT 119 STATISTICS AND ELEMENTARY ALGEBRA 5 Lecture Hours, 2 Lab Hours, 3 Credits Pre-

Using Excel for Statistical Analysis

Using Excel for Statistical Analysis You don t have to have a fancy pants statistics package to do many statistical functions. Excel can perform several statistical tests and analyses. First, make sure

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT The purpose of this paper is to investigate several SAS procedures that are used in

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

II. DISTRIBUTIONS distribution normal distribution. standard scores

Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

Introduction to time series analysis

Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

Motor and Household Insurance: Pricing to Maximise Profit in a Competitive Market

Motor and Household Insurance: Pricing to Maximise Profit in a Competitive Market by Tom Wright, Partner, English Wright & Brockman 1. Introduction This paper describes one way in which statistical modelling

INTERPRETATION OF REGRESSION OUTPUT: DIAGNOSTICS, GRAPHS AND THE BOTTOM LINE. Wesley Johnson and Mitchell Watnik University of California USA

INTERPRETATION OF REGRESSION OUTPUT: DIAGNOSTICS, GRAPHS AND THE BOTTOM LINE Wesley Johnson and Mitchell Watnik University of California USA A standard approach in presenting the results of a statistical

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies