Advanced data analysis

Size: px
Start display at page:

Download "Advanced data analysis"

Transcription

1 Advanced data analysis M.Gerolimetto Dip. di Statistica Università Ca Foscari Venezia, margherita

2 PART 2: LOGISTIC REGRESSION

3 Dichotomous dependent variables When in a multiple regression model the dependent variable is qualitative and must be expressed by a dummy variable, special estimation problems arise. An example is the problem of explaining whether or not an individual will buy a car: the dependent variable Y takes value 1 in case he/she will buy the car, 0 in case he/she will not buy the car. The explicative variables can be also qualitative variables represented by dummy (for example the characteristics of the car) as well as quantitative variables (for example the price of the car and the income of the person). The predicted values of the dependent variable will fall mainly in the interval [0,1], so those values can be interpreted as the probability that the individual will buy the car, given the characteristics or the income etc. Approximating this relationship by a line produces a very bad fit.

4 In order to avoid having probabilities outside the [0,1] range, instead of a linear model a NON LIN- EAR MODEL is used that works squeezing the probabilities in the (0,1) range. So, in place of a linear function (as in the linear multiple regression) a nonlinear function is used. The most common non linear function, very appropriate for this framework, are the LOGISTIC FUNCTION and the CUMULATIVE NORMAL FUNC- TION. Linear function > usual multiple regression model Y = Xβ + u Logistic function > logistic regr. Cumulative normal function > probit regr. P (Y = 1) = f(xβ) It is called logistic or probit regression, it depends on the choice of f(), called link function.

5 A hidden error term In logistic (probit) regression the error term does not appear clearly in the model. It is somehow implicit. This formulation comes from the Random utility model that states that an individual buys a certain product (Y = 1) if the utility (connected to it) is bigger than a threshold or the highest compared to other products in the same consideration set. This utility is expressed by a linear function that includes an error term (!!). So the expression P(Y = 1) is equivalent to P(U > 0) where U = Xβ + u. Hence P (Y = 1) = P (Xβ+u > 0) and it is modeled by f(xβ). The error term is implicit in the action of writing P(Y = 1) so it does not appear explicitly in the logit or probit model. NB The error term is always in the model even though it is not evident!!!

6 Violating the assumptions of multiple regression The binary nature of the dependent variable leads to the violation of some assumptions of the multiple regression model: 1. The Y s are not normal, but they follow a binomial distribution instead. The inference of the multiple regression model, based on the assumption of normality, loses here its validity. 2. The variance of a binomial variable is not constant. Also the hypothesis of homoskedasticity is violated. 3. The conditional expectation E(Y X) is fitted by function that is non linear. The logistic regression (and the probit model as well) deals with these problems by introducing a different approach to estimate the coefficients, to interpret them and also to state the goodness of fit of the model (= no more ordinary least squares estimation, no more R 2 ).

7 Inference When estimating the coefficients in the logistic regression context the standard OLS inference cannot be used. One possibility is to estimate the coefficients with the maximum likelihood procedure (ML). The maximum likelihood procedure requires the maximization of the likelihood function. This is equivalent to obtain the estimates of the coefficients that mostly accord to the empirical evidence given by the sample. Note that to obtain maximimum likelihood estimates it is often used an efficient numeric algorithm called Iterative Reweighted Least Squares.

8 The logistic function Suppose to have a sample of n units that give observation for variable Y (dichotomous) and X (quantitative) and we want to explain Y with X, using a logistic regression. Hence the link function is the logistic function (usually indicated by g) that for a generic variable z is: g(z) = ez 1 + e z In logistic regression context, this function is exploited as follows: P (Y = 1) = exβ 1 + e Xβ P (Y = 0) = 1 P (Y = 1) = e Xβ Once the estimates for parameter β are obtained, call it b, the fitted values are ˆP (Y = 1) = exb 1 + e Xb ˆP (Y = 0) = e Xb

9 The coefficients Because of the presence of nonlinear functional forms, the marginal effect of an explicative variable on the dependent variable is not given by that explanatory variable coefficient, but by an opportune function of the coefficient. A way out to interpret the coefficient is to consider the expression: e Xβ P (Y = 1) log P (Y = 0) = log 1+e Xβ = Xβ 1 1+e Xβ The logarithm of the ratio of the probability that the event occurs and the probability that the event does not occur (name: ODDS RATIO) is called LOGIT transform. Hence the coefficients β have to be read as a the reaction, in terms of variation of the logit, consequently to a unit variation in the explicative variable.

10 Odds ratio Another way out to interpret the coefficients is by considering only the ratio of the probability that the event occurs and the probability that the event does not occur. This ratio is P (Y = 1) P (Y = 0) = e Xβ 1+e Xβ 1 1+e Xβ = e Xβ In this case the exponentiated coefficients reflect changes in the odds ratio, consequently to a unit variation in the explicative variable. Coefficients (β) are in particular useful to determine the sign of the relationship: a positive coefficient indicates that a unit increase in the X is connected with an increases the predicted probability and viceversa. Exponentiated coefficients (e β ) are in particular useful to express the magnitude of the relationship: the impact is multiplicative, which means that we have information on how bigger (or smaller) becomes P(Y=1) (compared to P(Y=0)).

11 Significance of the coefficients In logistic regression the null hypothesis of not significant coefficients is tested similarly to what is done in the linear regression. If we are considering β coefficients (i.e. the logit has to be considered as the dependent variable), a zero of the coefficient means that the variable has no impact. To confirm this, think of the e β coefficients (i.e. the odds ratio has to be considered as the dependent variable). When β is zero, then e β = 1, then P (Y =1) P (Y =0) = 1. If the odds ratio is equal to 1, P (Y = 1) = P (Y = 0) hence there is no way this explicative variable is useful in making predictions! The hypothesis testing is done with the Wald test (instead of the T test), because the estimation method is not the standard OLS.

12 Goodness of fit 1 One possibility to assess model estimation fit is using Pseudo R 2 values that works similarly to the R 2 described for the multiple regression analysis. In multiple linear regression, the R 2 is built on the basis of the sum of squared residuals, which also the quantity minimized to obtain the estimate of the coefficients. Similarly, in logistic regression the Pseudo R 2 is built on the ground of the likelihood value. In particular, in logistic regression the model estimation fit is measured with a quantity that is 2LL (LL is log-likelihood) that is positive and takes value zero in case of perfect fit (log1 = 0), hence the closer to zero is the 2LL, the better is the fit.

13 The 2LL value can be used to compare different models. The idea is to compute 2LL for the rival (nested!) models and then choose the model with the lowest 2LL value. It is also possible to test the significance of the difference between 2LL computed for rival models (χ 2 test). In order to produce an index, based on 2LL, readable as a Pseudo R 2 (something in the (0,1) interval) the index 2LL can be normalized by comparing the 2LL obtained for the examined model with the 2LL value of a hypothetic null model (very bad one, with only intercept). R 2 logit = 2LL null ( 2LL model ) 2LL null

14 The Akaike index is another useful tool to compare different models. The formula is: AIC = 2LL + 2 p Where p is the number of estimated coefficients. The preferred model is that with the lowest AIC. Hence AIC not only considers the goodness of fit, but it also penalizes the overfitting as it is an increasing function of the number of estimated parameters. The objective is to find the model that best explains the data with a minimum of free parameters. AIC can be used for all models not only for the logistic!

15 Goodness of fit 2 Another possibility to assess the goodness of fit is by the concept of predictive accuracy. The idea is to compare actual and fitted values of Y, interpreted in terms of membership to a certain class. A possible good indicator of this capability of the model to discriminate between groups (for example group of those who buy the car/ those who do not by the car) is the sum of the fraction of zeros correctly predicted plus the sum of the fraction of ones correcly predicted. If the resulting sum exceeds 1 the model is of value.

16 Dummy variables Also in logistic regression dummy variables can be among the explicative variables set. Focusing on exponentiated coefficients, they represent the relative level of the dependent variable for the represented group, compared to the omitted group. EXAMPLE: Suppose that Y=1 corresponds to an improve in professional position. Among the explicative variables there is the gender (male-female) to which is associated a dummy variable (female=1, male=0) included in the model (Remember multicollinearity!) Suppose 0.78 is the estimated exponentiated coefficient for that dummy variable. The analysis has to be done for female compared to male (the omitted dummy), hence the probability connected to the female group is 0.78 times the probability connected to the male group. This means that the probability decreases if the person is female compared to a male person. In multiplicative terms the amount of this reduction is 0.78.

17 If the qualitative dependent variable has more than two choice categories (not binary, but polychotomous) the model presented in the previous slides has to be generalized. In this case the model is called multinomial logit or multinomial probit. For example a commuter can choose among private car, bus, train. The result is a model composed by 3 equations (one for each alternative) where the probability of choosing exactly that alternative is expressed by means of a logit model. The disadvantage of this model is in that it is assumed the independence of the irrelevant alternatives property that means that the odds ratio are constant even though another alternative is included in the consideration set. This is unrealistic, especially if two or more alternatives are close substitutes.

18 GLM models Generalized Linear Models are a class of statistical models that generalize the classical linear models and include the logistic regression. Special cases: the linear regression > when the Y s are gaussian variables. the logistic regression > when the Y s are binary variables (binomial distribution). GLM are very useful to treat non linear models that can be expressed in a linear form! NB: standard OLS inference does not hold. Mainly in the GLM context the inference is based on the likelihood function.

19 The main characteristics of the GLM models are: 1. The Y s belong to an extremely general family distribution called exponential family that includes, among the other, also the well-known gaussian, binomial, gamma, chi-square, exponential variables. 2. The model can be rendered linear taking an opportune function.

20 A real data analysis: 2006 Poverty in Italy in year Poverty is an extremely complex concept. In economics, many different definitions have been proposed: one of the most used is the relative poverty definition. Relative poverty means that a person is considered poor if his/her income is smaller than a certain threshold (for example the average income of the population). Note that poverty is treated as a binary attribute (poor-not poor) that derives from the dichotomization of a quantitative variable (income). Indeed, poor (Y=1) are those whose income (I) is less than the fixed threshold. Not poor (Y=0) are those whose income is bigger than the fixed threshold. To model the POVERTY RISK it is possible to use a logistic regression.

21 The data set we use comes from the Italian Households Expenditure Survey (realized by ISTAT). The sample (it is a probabilistic sample) includes more than households that give a representative image of the Italian population. Information on the income is available only in terms of the income quartile to which the household belongs. Hence, total expenditure is used as a proxy of the income. The poverty threshold is the 60 percent of the median Italian expenditure (across all households). Let us call this threshold K. The generic household i is poor if E i <= K (E is the expenditure, as a proxy of I), it is not poor if E i > K. So a dummy variable Y is created dichotomizing the expenditure to describe the relative poverty phenomenon.

22 The poverty risk (the probability of being poor: the probability that Y = 1) has been modeled using a logistic regression where the explicative variables are: 1. Gender: male, female 2. Education: primary, middle, high, graduation 3. Family dimension: small (1-2), medium (3-4), big (more than 5) 4. Employment status: Employed, unemployed, other (e.g. students, retired...) 5. Presence of old people: Yes, No (old means age>65) 6. Presence of children: Yes, No (children means age<14) 7. Geographic location: North, Center, South

23 Every qualitative variable has been traduced into a dummy variables set. 1. Gender: D m = 1 if the household head is a man, D m = 0 otherwise D f = 1 if the household head is a woman, D f = 0 otherwise 2. Education: D ep = 1 if the household head has primary school license, D ep = 0 otherwise D em = 1 if the household head has middle school license, D em = 0 otherwise D eh = 1 if the household head has high school license, D eh = 0 otherwise D eg = 1 if the household head has graduated, D eg = 0 otherwise

24 3. Family dimension: D ds = 1 if the household has small dimension, D ds = 0 otherwise D dm = 1 if the household has medium dimension, D dm = 0 otherwise D db = 1 if the household has big dimension, D db = 0 otherwise 4. Employment status: D ee = 1 if the household head is employed, D ee = 0 otherwise D eu = 1 if the household head is unemployed, D eu = 0 otherwise D eo = 1 if the household head is classified as other, D eo = 0 otherwise

25 5. Presence of old people: D oy = 1 if there is an old person in the household, D oy = 0 otherwise D on = 1 if there are no old people in the household, D on = 0 otherwise 6. Presence of children: D cy = 1 if there is a child in the household, D cy = 0 otherwise D cn = 1 if there are no children in the household, D cn = 0 otherwise

26 7. Geographic location: D gn = 1 if the family lives in the North of Italy, D gn = 0 otherwise D gc = 1 if the family lives in the Center of Italy, D gc = 0 otherwise D gs = 1 if the family lives in the South of Italy, D gs = 0 otherwise

27 In order to avoid multicollinearity, from each dummy set one variable has to be excluded. We exclude the variables connected to the character that is more frequent in the population in order to have a sort of benchmark family to which the comparative analysis has to be referred. The benchmark family is characterized by a male household head with primary school license and employed. The dimension is small without old people and without children. Can you write the model?

LOGIT AND PROBIT ANALYSIS

LOGIT AND PROBIT ANALYSIS LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 amitvasisht@iasri.res.in In dummy regression variable models, it is assumed implicitly that the dependent variable Y

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

The Probit Link Function in Generalized Linear Models for Data Mining Applications

The Probit Link Function in Generalized Linear Models for Data Mining Applications Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

More information

Logistic Regression. BUS 735: Business Decision Making and Research

Logistic Regression. BUS 735: Business Decision Making and Research Goals of this section 2/ 8 Specific goals: Learn how to conduct regression analysis with a dummy independent variable. Learning objectives: LO2: Be able to construct and use multiple regression models

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

More information

Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests

Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

More information

Regression with a Binary Dependent Variable

Regression with a Binary Dependent Variable Regression with a Binary Dependent Variable Chapter 9 Michael Ash CPPA Lecture 22 Course Notes Endgame Take-home final Distributed Friday 19 May Due Tuesday 23 May (Paper or emailed PDF ok; no Word, Excel,

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

MULTIPLE REGRESSION WITH CATEGORICAL DATA

MULTIPLE REGRESSION WITH CATEGORICAL DATA DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting

More information

Statistical tests for SPSS

Statistical tests for SPSS Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Multiple Choice Models II

Multiple Choice Models II Multiple Choice Models II Laura Magazzini University of Verona laura.magazzini@univr.it http://dse.univr.it/magazzini Laura Magazzini (@univr.it) Multiple Choice Models II 1 / 28 Categorical data Categorical

More information

Calculating the Probability of Returning a Loan with Binary Probability Models

Calculating the Probability of Returning a Loan with Binary Probability Models Calculating the Probability of Returning a Loan with Binary Probability Models Associate Professor PhD Julian VASILEV (e-mail: vasilev@ue-varna.bg) Varna University of Economics, Bulgaria ABSTRACT The

More information

2. Linear regression with multiple regressors

2. Linear regression with multiple regressors 2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions

More information

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. A General Formulation 3. Truncated Normal Hurdle Model 4. Lognormal

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Financial Vulnerability Index (IMPACT)

Financial Vulnerability Index (IMPACT) Household s Life Insurance Demand - a Multivariate Two Part Model Edward (Jed) W. Frees School of Business, University of Wisconsin-Madison July 30, 1 / 19 Outline 1 2 3 4 2 / 19 Objective To understand

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Multinomial Logistic Regression

Multinomial Logistic Regression Multinomial Logistic Regression Dr. Jon Starkweather and Dr. Amanda Kay Moske Multinomial logistic regression is used to predict categorical placement in or the probability of category membership on a

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables

Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables We often consider relationships between observed outcomes

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

1. Suppose that a score on a final exam depends upon attendance and unobserved factors that affect exam performance (such as student ability).

1. Suppose that a score on a final exam depends upon attendance and unobserved factors that affect exam performance (such as student ability). Examples of Questions on Regression Analysis: 1. Suppose that a score on a final exam depends upon attendance and unobserved factors that affect exam performance (such as student ability). Then,. When

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

More information

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

More information

Module 4 - Multiple Logistic Regression

Module 4 - Multiple Logistic Regression Module 4 - Multiple Logistic Regression Objectives Understand the principles and theory underlying logistic regression Understand proportions, probabilities, odds, odds ratios, logits and exponents Be

More information

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Statistics 2014 Scoring Guidelines

Statistics 2014 Scoring Guidelines AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

Logistic regression modeling the probability of success

Logistic regression modeling the probability of success Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might

More information

Online Appendix to Are Risk Preferences Stable Across Contexts? Evidence from Insurance Data

Online Appendix to Are Risk Preferences Stable Across Contexts? Evidence from Insurance Data Online Appendix to Are Risk Preferences Stable Across Contexts? Evidence from Insurance Data By LEVON BARSEGHYAN, JEFFREY PRINCE, AND JOSHUA C. TEITELBAUM I. Empty Test Intervals Here we discuss the conditions

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

More information

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

More information

Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics. Measurement. Scales of Measurement 7/18/2012 Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

More information

Solución del Examen Tipo: 1

Solución del Examen Tipo: 1 Solución del Examen Tipo: 1 Universidad Carlos III de Madrid ECONOMETRICS Academic year 2009/10 FINAL EXAM May 17, 2010 DURATION: 2 HOURS 1. Assume that model (III) verifies the assumptions of the classical

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Introduction to Predictive Modeling Using GLMs

Introduction to Predictive Modeling Using GLMs Introduction to Predictive Modeling Using GLMs Dan Tevet, FCAS, MAAA, Liberty Mutual Insurance Group Anand Khare, FCAS, MAAA, CPCU, Milliman 1 Antitrust Notice The Casualty Actuarial Society is committed

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

SUGI 29 Statistics and Data Analysis

SUGI 29 Statistics and Data Analysis Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA Chapter 13 introduced the concept of correlation statistics and explained the use of Pearson's Correlation Coefficient when working

More information

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University Practical I conometrics data collection, analysis, and application Christiana E. Hilmer Michael J. Hilmer San Diego State University Mi Table of Contents PART ONE THE BASICS 1 Chapter 1 An Introduction

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique

More information

Regression 3: Logistic Regression

Regression 3: Logistic Regression Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

A model to predict churn

A model to predict churn A model to predict churn Hilda Cecilia Lindvall April 18, 2014 Abstract This Master Thesis has been performed at Svenska Spel with the aim to detect playing customers probability to churn, i.e. quit their

More information

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Paper 264-26 Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Abstract: There are several procedures in the SAS System for statistical modeling. Most statisticians who use the SAS

More information

End User Satisfaction With a Food Manufacturing ERP

End User Satisfaction With a Food Manufacturing ERP Applied Mathematical Sciences, Vol. 8, 2014, no. 24, 1187-1192 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2014.4284 End-User Satisfaction in ERP System: Application of Logit Modeling Hashem

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT The purpose of this paper is to investigate several SAS procedures that are used in

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Example: Boats and Manatees

Example: Boats and Manatees Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

More information

Logistic regression (with R)

Logistic regression (with R) Logistic regression (with R) Christopher Manning 4 November 2007 1 Theory We can transform the output of a linear regression to be suitable for probabilities by using a logit link function on the lhs as

More information

Standard errors of marginal effects in the heteroskedastic probit model

Standard errors of marginal effects in the heteroskedastic probit model Standard errors of marginal effects in the heteroskedastic probit model Thomas Cornelißen Discussion Paper No. 320 August 2005 ISSN: 0949 9962 Abstract In non-linear regression models, such as the heteroskedastic

More information

Lean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Lean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY Before we begin: Turn on the sound on your computer. There is audio to accompany this presentation. Audio will accompany most of the online

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information