Advanced data analysis
|
|
- Victor James
- 7 years ago
- Views:
Transcription
1 Advanced data analysis M.Gerolimetto Dip. di Statistica Università Ca Foscari Venezia, margherita
2 PART 2: LOGISTIC REGRESSION
3 Dichotomous dependent variables When in a multiple regression model the dependent variable is qualitative and must be expressed by a dummy variable, special estimation problems arise. An example is the problem of explaining whether or not an individual will buy a car: the dependent variable Y takes value 1 in case he/she will buy the car, 0 in case he/she will not buy the car. The explicative variables can be also qualitative variables represented by dummy (for example the characteristics of the car) as well as quantitative variables (for example the price of the car and the income of the person). The predicted values of the dependent variable will fall mainly in the interval [0,1], so those values can be interpreted as the probability that the individual will buy the car, given the characteristics or the income etc. Approximating this relationship by a line produces a very bad fit.
4 In order to avoid having probabilities outside the [0,1] range, instead of a linear model a NON LIN- EAR MODEL is used that works squeezing the probabilities in the (0,1) range. So, in place of a linear function (as in the linear multiple regression) a nonlinear function is used. The most common non linear function, very appropriate for this framework, are the LOGISTIC FUNCTION and the CUMULATIVE NORMAL FUNC- TION. Linear function > usual multiple regression model Y = Xβ + u Logistic function > logistic regr. Cumulative normal function > probit regr. P (Y = 1) = f(xβ) It is called logistic or probit regression, it depends on the choice of f(), called link function.
5 A hidden error term In logistic (probit) regression the error term does not appear clearly in the model. It is somehow implicit. This formulation comes from the Random utility model that states that an individual buys a certain product (Y = 1) if the utility (connected to it) is bigger than a threshold or the highest compared to other products in the same consideration set. This utility is expressed by a linear function that includes an error term (!!). So the expression P(Y = 1) is equivalent to P(U > 0) where U = Xβ + u. Hence P (Y = 1) = P (Xβ+u > 0) and it is modeled by f(xβ). The error term is implicit in the action of writing P(Y = 1) so it does not appear explicitly in the logit or probit model. NB The error term is always in the model even though it is not evident!!!
6 Violating the assumptions of multiple regression The binary nature of the dependent variable leads to the violation of some assumptions of the multiple regression model: 1. The Y s are not normal, but they follow a binomial distribution instead. The inference of the multiple regression model, based on the assumption of normality, loses here its validity. 2. The variance of a binomial variable is not constant. Also the hypothesis of homoskedasticity is violated. 3. The conditional expectation E(Y X) is fitted by function that is non linear. The logistic regression (and the probit model as well) deals with these problems by introducing a different approach to estimate the coefficients, to interpret them and also to state the goodness of fit of the model (= no more ordinary least squares estimation, no more R 2 ).
7 Inference When estimating the coefficients in the logistic regression context the standard OLS inference cannot be used. One possibility is to estimate the coefficients with the maximum likelihood procedure (ML). The maximum likelihood procedure requires the maximization of the likelihood function. This is equivalent to obtain the estimates of the coefficients that mostly accord to the empirical evidence given by the sample. Note that to obtain maximimum likelihood estimates it is often used an efficient numeric algorithm called Iterative Reweighted Least Squares.
8 The logistic function Suppose to have a sample of n units that give observation for variable Y (dichotomous) and X (quantitative) and we want to explain Y with X, using a logistic regression. Hence the link function is the logistic function (usually indicated by g) that for a generic variable z is: g(z) = ez 1 + e z In logistic regression context, this function is exploited as follows: P (Y = 1) = exβ 1 + e Xβ P (Y = 0) = 1 P (Y = 1) = e Xβ Once the estimates for parameter β are obtained, call it b, the fitted values are ˆP (Y = 1) = exb 1 + e Xb ˆP (Y = 0) = e Xb
9 The coefficients Because of the presence of nonlinear functional forms, the marginal effect of an explicative variable on the dependent variable is not given by that explanatory variable coefficient, but by an opportune function of the coefficient. A way out to interpret the coefficient is to consider the expression: e Xβ P (Y = 1) log P (Y = 0) = log 1+e Xβ = Xβ 1 1+e Xβ The logarithm of the ratio of the probability that the event occurs and the probability that the event does not occur (name: ODDS RATIO) is called LOGIT transform. Hence the coefficients β have to be read as a the reaction, in terms of variation of the logit, consequently to a unit variation in the explicative variable.
10 Odds ratio Another way out to interpret the coefficients is by considering only the ratio of the probability that the event occurs and the probability that the event does not occur. This ratio is P (Y = 1) P (Y = 0) = e Xβ 1+e Xβ 1 1+e Xβ = e Xβ In this case the exponentiated coefficients reflect changes in the odds ratio, consequently to a unit variation in the explicative variable. Coefficients (β) are in particular useful to determine the sign of the relationship: a positive coefficient indicates that a unit increase in the X is connected with an increases the predicted probability and viceversa. Exponentiated coefficients (e β ) are in particular useful to express the magnitude of the relationship: the impact is multiplicative, which means that we have information on how bigger (or smaller) becomes P(Y=1) (compared to P(Y=0)).
11 Significance of the coefficients In logistic regression the null hypothesis of not significant coefficients is tested similarly to what is done in the linear regression. If we are considering β coefficients (i.e. the logit has to be considered as the dependent variable), a zero of the coefficient means that the variable has no impact. To confirm this, think of the e β coefficients (i.e. the odds ratio has to be considered as the dependent variable). When β is zero, then e β = 1, then P (Y =1) P (Y =0) = 1. If the odds ratio is equal to 1, P (Y = 1) = P (Y = 0) hence there is no way this explicative variable is useful in making predictions! The hypothesis testing is done with the Wald test (instead of the T test), because the estimation method is not the standard OLS.
12 Goodness of fit 1 One possibility to assess model estimation fit is using Pseudo R 2 values that works similarly to the R 2 described for the multiple regression analysis. In multiple linear regression, the R 2 is built on the basis of the sum of squared residuals, which also the quantity minimized to obtain the estimate of the coefficients. Similarly, in logistic regression the Pseudo R 2 is built on the ground of the likelihood value. In particular, in logistic regression the model estimation fit is measured with a quantity that is 2LL (LL is log-likelihood) that is positive and takes value zero in case of perfect fit (log1 = 0), hence the closer to zero is the 2LL, the better is the fit.
13 The 2LL value can be used to compare different models. The idea is to compute 2LL for the rival (nested!) models and then choose the model with the lowest 2LL value. It is also possible to test the significance of the difference between 2LL computed for rival models (χ 2 test). In order to produce an index, based on 2LL, readable as a Pseudo R 2 (something in the (0,1) interval) the index 2LL can be normalized by comparing the 2LL obtained for the examined model with the 2LL value of a hypothetic null model (very bad one, with only intercept). R 2 logit = 2LL null ( 2LL model ) 2LL null
14 The Akaike index is another useful tool to compare different models. The formula is: AIC = 2LL + 2 p Where p is the number of estimated coefficients. The preferred model is that with the lowest AIC. Hence AIC not only considers the goodness of fit, but it also penalizes the overfitting as it is an increasing function of the number of estimated parameters. The objective is to find the model that best explains the data with a minimum of free parameters. AIC can be used for all models not only for the logistic!
15 Goodness of fit 2 Another possibility to assess the goodness of fit is by the concept of predictive accuracy. The idea is to compare actual and fitted values of Y, interpreted in terms of membership to a certain class. A possible good indicator of this capability of the model to discriminate between groups (for example group of those who buy the car/ those who do not by the car) is the sum of the fraction of zeros correctly predicted plus the sum of the fraction of ones correcly predicted. If the resulting sum exceeds 1 the model is of value.
16 Dummy variables Also in logistic regression dummy variables can be among the explicative variables set. Focusing on exponentiated coefficients, they represent the relative level of the dependent variable for the represented group, compared to the omitted group. EXAMPLE: Suppose that Y=1 corresponds to an improve in professional position. Among the explicative variables there is the gender (male-female) to which is associated a dummy variable (female=1, male=0) included in the model (Remember multicollinearity!) Suppose 0.78 is the estimated exponentiated coefficient for that dummy variable. The analysis has to be done for female compared to male (the omitted dummy), hence the probability connected to the female group is 0.78 times the probability connected to the male group. This means that the probability decreases if the person is female compared to a male person. In multiplicative terms the amount of this reduction is 0.78.
17 If the qualitative dependent variable has more than two choice categories (not binary, but polychotomous) the model presented in the previous slides has to be generalized. In this case the model is called multinomial logit or multinomial probit. For example a commuter can choose among private car, bus, train. The result is a model composed by 3 equations (one for each alternative) where the probability of choosing exactly that alternative is expressed by means of a logit model. The disadvantage of this model is in that it is assumed the independence of the irrelevant alternatives property that means that the odds ratio are constant even though another alternative is included in the consideration set. This is unrealistic, especially if two or more alternatives are close substitutes.
18 GLM models Generalized Linear Models are a class of statistical models that generalize the classical linear models and include the logistic regression. Special cases: the linear regression > when the Y s are gaussian variables. the logistic regression > when the Y s are binary variables (binomial distribution). GLM are very useful to treat non linear models that can be expressed in a linear form! NB: standard OLS inference does not hold. Mainly in the GLM context the inference is based on the likelihood function.
19 The main characteristics of the GLM models are: 1. The Y s belong to an extremely general family distribution called exponential family that includes, among the other, also the well-known gaussian, binomial, gamma, chi-square, exponential variables. 2. The model can be rendered linear taking an opportune function.
20 A real data analysis: 2006 Poverty in Italy in year Poverty is an extremely complex concept. In economics, many different definitions have been proposed: one of the most used is the relative poverty definition. Relative poverty means that a person is considered poor if his/her income is smaller than a certain threshold (for example the average income of the population). Note that poverty is treated as a binary attribute (poor-not poor) that derives from the dichotomization of a quantitative variable (income). Indeed, poor (Y=1) are those whose income (I) is less than the fixed threshold. Not poor (Y=0) are those whose income is bigger than the fixed threshold. To model the POVERTY RISK it is possible to use a logistic regression.
21 The data set we use comes from the Italian Households Expenditure Survey (realized by ISTAT). The sample (it is a probabilistic sample) includes more than households that give a representative image of the Italian population. Information on the income is available only in terms of the income quartile to which the household belongs. Hence, total expenditure is used as a proxy of the income. The poverty threshold is the 60 percent of the median Italian expenditure (across all households). Let us call this threshold K. The generic household i is poor if E i <= K (E is the expenditure, as a proxy of I), it is not poor if E i > K. So a dummy variable Y is created dichotomizing the expenditure to describe the relative poverty phenomenon.
22 The poverty risk (the probability of being poor: the probability that Y = 1) has been modeled using a logistic regression where the explicative variables are: 1. Gender: male, female 2. Education: primary, middle, high, graduation 3. Family dimension: small (1-2), medium (3-4), big (more than 5) 4. Employment status: Employed, unemployed, other (e.g. students, retired...) 5. Presence of old people: Yes, No (old means age>65) 6. Presence of children: Yes, No (children means age<14) 7. Geographic location: North, Center, South
23 Every qualitative variable has been traduced into a dummy variables set. 1. Gender: D m = 1 if the household head is a man, D m = 0 otherwise D f = 1 if the household head is a woman, D f = 0 otherwise 2. Education: D ep = 1 if the household head has primary school license, D ep = 0 otherwise D em = 1 if the household head has middle school license, D em = 0 otherwise D eh = 1 if the household head has high school license, D eh = 0 otherwise D eg = 1 if the household head has graduated, D eg = 0 otherwise
24 3. Family dimension: D ds = 1 if the household has small dimension, D ds = 0 otherwise D dm = 1 if the household has medium dimension, D dm = 0 otherwise D db = 1 if the household has big dimension, D db = 0 otherwise 4. Employment status: D ee = 1 if the household head is employed, D ee = 0 otherwise D eu = 1 if the household head is unemployed, D eu = 0 otherwise D eo = 1 if the household head is classified as other, D eo = 0 otherwise
25 5. Presence of old people: D oy = 1 if there is an old person in the household, D oy = 0 otherwise D on = 1 if there are no old people in the household, D on = 0 otherwise 6. Presence of children: D cy = 1 if there is a child in the household, D cy = 0 otherwise D cn = 1 if there are no children in the household, D cn = 0 otherwise
26 7. Geographic location: D gn = 1 if the family lives in the North of Italy, D gn = 0 otherwise D gc = 1 if the family lives in the Center of Italy, D gc = 0 otherwise D gs = 1 if the family lives in the South of Italy, D gs = 0 otherwise
27 In order to avoid multicollinearity, from each dummy set one variable has to be excluded. We exclude the variables connected to the character that is more frequent in the population in order to have a sort of benchmark family to which the comparative analysis has to be referred. The benchmark family is characterized by a male household head with primary school license and employed. The dimension is small without old people and without children. Can you write the model?
LOGIT AND PROBIT ANALYSIS
LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 amitvasisht@iasri.res.in In dummy regression variable models, it is assumed implicitly that the dependent variable Y
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationGeneralized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationMultinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
More information11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
More informationThe Probit Link Function in Generalized Linear Models for Data Mining Applications
Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications
More informationLogistic Regression. BUS 735: Business Decision Making and Research
Goals of this section 2/ 8 Specific goals: Learn how to conduct regression analysis with a dummy independent variable. Learning objectives: LO2: Be able to construct and use multiple regression models
More informationASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
More informationOverview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)
Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and
More informationHYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
More informationOrdinal Regression. Chapter
Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe
More informationLogit Models for Binary Data
Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response
More informationLogistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests
Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy
More informationRegression with a Binary Dependent Variable
Regression with a Binary Dependent Variable Chapter 9 Michael Ash CPPA Lecture 22 Course Notes Endgame Take-home final Distributed Friday 19 May Due Tuesday 23 May (Paper or emailed PDF ok; no Word, Excel,
More informationIntroduction to Quantitative Methods
Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
More informationPoisson Models for Count Data
Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the
More informationElements of statistics (MATH0487-1)
Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -
More informationBinary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
More informationMULTIPLE REGRESSION WITH CATEGORICAL DATA
DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting
More informationStatistical tests for SPSS
Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly
More informationAuxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationMultiple Choice Models II
Multiple Choice Models II Laura Magazzini University of Verona laura.magazzini@univr.it http://dse.univr.it/magazzini Laura Magazzini (@univr.it) Multiple Choice Models II 1 / 28 Categorical data Categorical
More informationCalculating the Probability of Returning a Loan with Binary Probability Models
Calculating the Probability of Returning a Loan with Binary Probability Models Associate Professor PhD Julian VASILEV (e-mail: vasilev@ue-varna.bg) Varna University of Economics, Bulgaria ABSTRACT The
More information2. Linear regression with multiple regressors
2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions
More informationHURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009
HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. A General Formulation 3. Truncated Normal Hurdle Model 4. Lognormal
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationStatistics in Retail Finance. Chapter 2: Statistical models of default
Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision
More informationMultivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
More informationFinancial Vulnerability Index (IMPACT)
Household s Life Insurance Demand - a Multivariate Two Part Model Edward (Jed) W. Frees School of Business, University of Wisconsin-Madison July 30, 1 / 19 Outline 1 2 3 4 2 / 19 Objective To understand
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationMultinomial Logistic Regression
Multinomial Logistic Regression Dr. Jon Starkweather and Dr. Amanda Kay Moske Multinomial logistic regression is used to predict categorical placement in or the probability of category membership on a
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationWooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables
Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables We often consider relationships between observed outcomes
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
More informationA Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
More information1. Suppose that a score on a final exam depends upon attendance and unobserved factors that affect exam performance (such as student ability).
Examples of Questions on Regression Analysis: 1. Suppose that a score on a final exam depends upon attendance and unobserved factors that affect exam performance (such as student ability). Then,. When
More informationChapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationGLM I An Introduction to Generalized Linear Models
GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial
More informationLinda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents
Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén
More informationModule 4 - Multiple Logistic Regression
Module 4 - Multiple Logistic Regression Objectives Understand the principles and theory underlying logistic regression Understand proportions, probabilities, odds, odds ratios, logits and exponents Be
More informationUsing An Ordered Logistic Regression Model with SAS Vartanian: SW 541
Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationSection 6: Model Selection, Logistic Regression and more...
Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationVI. Introduction to Logistic Regression
VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models
More informationStatistics 2014 Scoring Guidelines
AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home
More informationL3: Statistical Modeling with Hadoop
L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationBayesX - Software for Bayesian Inference in Structured Additive Regression
BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich
More informationLogistic regression modeling the probability of success
Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might
More informationOnline Appendix to Are Risk Preferences Stable Across Contexts? Evidence from Insurance Data
Online Appendix to Are Risk Preferences Stable Across Contexts? Evidence from Insurance Data By LEVON BARSEGHYAN, JEFFREY PRINCE, AND JOSHUA C. TEITELBAUM I. Empty Test Intervals Here we discuss the conditions
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationLocation matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is
More informationAccurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are
More informationStatistics. Measurement. Scales of Measurement 7/18/2012
Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does
More informationSolución del Examen Tipo: 1
Solución del Examen Tipo: 1 Universidad Carlos III de Madrid ECONOMETRICS Academic year 2009/10 FINAL EXAM May 17, 2010 DURATION: 2 HOURS 1. Assume that model (III) verifies the assumptions of the classical
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
More informationLOGISTIC REGRESSION ANALYSIS
LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationPenalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
More informationIntroduction to Predictive Modeling Using GLMs
Introduction to Predictive Modeling Using GLMs Dan Tevet, FCAS, MAAA, Liberty Mutual Insurance Group Anand Khare, FCAS, MAAA, CPCU, Milliman 1 Antitrust Notice The Casualty Actuarial Society is committed
More informationLogistic Regression (a type of Generalized Linear Model)
Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge
More informationSUGI 29 Statistics and Data Analysis
Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,
More information5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationCHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA
CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA Chapter 13 introduced the concept of correlation statistics and explained the use of Pearson's Correlation Coefficient when working
More informationPractical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University
Practical I conometrics data collection, analysis, and application Christiana E. Hilmer Michael J. Hilmer San Diego State University Mi Table of Contents PART ONE THE BASICS 1 Chapter 1 An Introduction
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More informationUSING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA
USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique
More informationRegression 3: Logistic Regression
Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing
More informationAssociation Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
More informationA model to predict churn
A model to predict churn Hilda Cecilia Lindvall April 18, 2014 Abstract This Master Thesis has been performed at Svenska Spel with the aim to detect playing customers probability to churn, i.e. quit their
More informationModel Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.
Paper 264-26 Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Abstract: There are several procedures in the SAS System for statistical modeling. Most statisticians who use the SAS
More informationEnd User Satisfaction With a Food Manufacturing ERP
Applied Mathematical Sciences, Vol. 8, 2014, no. 24, 1187-1192 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2014.4284 End-User Satisfaction in ERP System: Application of Logit Modeling Hashem
More information1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationRegression III: Advanced Methods
Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationInstitute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
More informationLOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
More informationSP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY
SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT The purpose of this paper is to investigate several SAS procedures that are used in
More informationPS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
More informationExample: Boats and Manatees
Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant
More informationI L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationAnswer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
More informationUnit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
More informationWhat s New in Econometrics? Lecture 8 Cluster and Stratified Sampling
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and
More informationLogistic regression (with R)
Logistic regression (with R) Christopher Manning 4 November 2007 1 Theory We can transform the output of a linear regression to be suitable for probabilities by using a logit link function on the lhs as
More informationStandard errors of marginal effects in the heteroskedastic probit model
Standard errors of marginal effects in the heteroskedastic probit model Thomas Cornelißen Discussion Paper No. 320 August 2005 ISSN: 0949 9962 Abstract In non-linear regression models, such as the heteroskedastic
More informationLean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY
TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY Before we begin: Turn on the sound on your computer. There is audio to accompany this presentation. Audio will accompany most of the online
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More information