SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Size: px
Start display at page:

Download "SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY"

Transcription

1 SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT The purpose of this paper is to investigate several SAS procedures that are used in linear predictive models in SASStat. The primary focus will be on the correct choice of model given the designated outcome variable, and the combination of input variables. Procedures to be discussed include GLM, LOGISTIC, GENMOD, MIXED, and GLIMMIX. PROC GLIMMIX is a relatively new SAS procedure, although it has been available as a macro for some time. There are three main types of variables used in linear models: nominal, ordinal, and interval. Nominal is defined as categorical (such as gender) ordinal is defined as categorical that can be ordered from least to most (such as employee evaluation rank) interval data can define ratios. While all of the models discussed can include all three types of input variables, the model choice is different if the outcome variable is interval or nominal. Another consideration for model choice is whether the input variables are fixed effects or random effects. Fixed effects are definitive, and will not change regardless of the sample data collection. Random effects can change when the experiment is replicated. Examples of random effects include subjects in a drug study, choice of items to compare between retail stores for market basket price differences, and classrooms in an education study. Examples will be discussed. INTRODUCTION An inappropriate model will provide inappropriate results. For those users of SAS who know SASStat and PROC GLM, there are other models that are more appropriate to the collected data. It is necessary to fit the model to the data-not the data to the model (to a man with a hammer.) If regression is not appropriate because the assumptions are violated, change the model. There are several models readily available in SASStat (Figure 1). Figure 1. Linear Models Available in SASStat Generalized Linear Mixed Model PROC GLIMMIX Linear mixed model PROC MIXED Generalized Linear Model PROC GENMOD General Linear Model PROC GLM ANOVA PROC ANOVA Regression PROC REG Logistic Regression PROC LOGISTIC Each model serves a different purpose, and should be used with different types of data. The purpose of this paper is to focus on model choice it is not intended to provide all details concerning the use of each model. Should the investigator choose one of the models, details are available in on-line docs. Items that must be considered in model choice are 1. Type of outcome variable-whether nominal, ordinal, or interval 2. Type of input variable-whether nominal ordinal, or interval 1

2 3. Type of input variable-whether fixed or random effect 4. Choice of covariance matrix format for random effects 5. Choice of link function for non-normal residuals As the complexity of the data increases, so, too, does the complexity of the model. Choices must be made, choices that impact model outcomes. Consider Table 1, which gives some indication as to how the models should be used. Table 1. Outline of Model Choice Model Output Variable Types of Inputs Assumptions ANOVA Interval Categorical, Fixed Effects only Normality REG Interval Interval, Fixed Effects only Normality LOGISTIC Binary Categorical, Interval, Fixed Effects only Log-Normal GLM Interval Categorical, Interval, Fixed Effects only Normality GENMOD Categorical, Interval Categorical, Interval, Fixed Effects Only Exponential Family MIXED Interval Categorical, Interval, Random Effects Normality GLIMMIX Categorical, Interval Categorical, Interval, Random Effects Exponential Family This paper will discuss the different models, and how to define outcomes and inputs, along with a consideration of the assumptions as listed in Table 1. PROC ANOVA and PROC REG ANOVA should only be used for a balanced design in which every categorical choice is divided equally. If there are three treatments, then each treatment should have exactly the same number of observations. This procedure requires less computing time compared to PROC GLM. However, since a completely balanced design almost never happens with large samples, there is really no need to use ANOVA instead of GLM. PROC REG can only use interval or ordinal variables as inputs. In order to include nominal data, dummy variables need to be created. Too many nominal inputs requires considerable programming effort. Essentially, for each level of a nominal variable, PROC REG creates a new regression line that is parallel to the regression lines for all other levels of the same variable. While PROC REG has diagnostics that are of value, the same diagnostics have now been incorporated into PROC GLM. For this reason, it is better to use PROC GLM for all standard analyses. PROC GLM In the past, PROC GLM was the most sophisticated procedure for performing a linear models analysis. It can use both interval and categorical variables as inputs it now contains all of the diagnostic elements provided by PROC REG, and it does not require a balanced design. In addition, PROC GLM uses the Type III Sum of Squares to examine multiple types of treatments simultaneously. The one problem with PROC GLM is that is was never intended to be used with random effects. Special cases of random effects, such as nested designs and split plot designs have been developed for use with PROC GLM. Repeated measures, also, can be examined using PROC GLM provided that there are few subjects dropping out in the later time measurements. However, PROC GLM has become the model of choice that is used, and very little consideration is usually given to whether the inputs are fixed or random effects. Repeated measures represent a random effect since the choice of time points to collect measurements is somewhat arbitrary on the part of the investigator. Inputs such as age that are divided into blocks are also random effects since the blocks are arbitrary. For the same reason, Likert scales are random effects since it is somewhat arbitrary whether a 4-point or a 5-point scale is used. However, in many cases, these inputs are entered into PROC GLM as if they were fixed effects. However, as is true in the special cases of split plots and nested effects, assuming the effects are fixed when they are random will increase the size of the random error. That will decrease the overall size of the F-statistics. As a result, the model will have non-significant F-statistics that should be significant. Consider the following question, Should ordinal variables be defined as quantitative, or as classification variables in PROC GLM? Since ANOVA assumes class levels (ie nominal data), and regression assumes interval data, there is no real provision for ordinal variables. If defined as a class variable, many degrees of freedom will be used, but posthoc tests can be made. If defined as interval, only one degree of freedom is used in the model but post-hoc tests are unavailable. Depending on the choice, model results can differ. Sample GLM code is listed below: 2

3 PROC GLM DATA=WORK.SORT7659 CLASS CourseLevel expectknownever PROC LOGISTIC MODEL hours= CourseLevel expectknownever SS3 SOLUTION SINGULAR=1E-07 LSMEANS CourseLevel PDIFF=ALL LSMEANS CourseLevel expectknownever PDIFF=ALL PROC LOGISTIC is very similar to PROC GLM, although it has a binary outcome variable rather than an interval outcome. If the outcome is ordinal, PROC LOGISTIC can also be used, but with a complementary log-log link function instead of the more standard log function. Both PROC LOGISTIC and PROC GLM can place ordinal inputs either as class or as quantitative variables. Again, consideration of the degrees of freedom and the necessity of post-hoc tests should be made before deciding where to place the ordinal inputs. Frequently, logistic regression is used to divide a population into high risklow risk. However, this dichotomous outcome is contrived. There could just as easily be 5 or 10 categories of risk. It is not necessary to reduce the number of outcomes to 2 just to fit the results into a logistic model. Logistic regression also defines odds ratios for the input variables. However, the default does not provide confidence limits for them. Therefore, the user should always use the option to print confidence limits. In addition, the user should examine the c-statistics. It is comparable to the r 2 for the general linear model. If the outcome variable only has two levels, logistic regression can also print a classification table and a receiver operating curve. They can be used to define a cut-point to divide the population into the highlow categories. Standard code is given below: PROC LOGISTIC DATA=WORK.SORT7975 CLASS BS (PARAM=EFFECT) workhabits (PARAM=EFFECT) MODEL CourseLevel=BS workhabits hours SELECTION=NONE LINK=PROBIT CLPARM=WALD CLODDS=WALD ALPHA=0.05 OUTPUT OUT=SASUSER.PRED3492(LABEL="Logistic regression predictions and statistics for SASUSER.QURY0181") PREDPROBS=INDIVIDUAL For ordinal (or nominal outcomes with more than 2 levels), the code used is PROC LOGISTIC DATA=WORK.SORT1118 CLASS BS (PARAM=EFFECT) workhabits (PARAM=EFFECT) MODEL CourseLevel=BS workhabits SELECTION=NONE LINK=CLOGLOG CLPARM=WALD CLODDS=WALD ALPHA=0.05 OUTPUT OUT=SASUSER.PRED1881(LABEL="Logistic regression predictions and statistics for SASUSER.QURY0181") PREDPROBS=INDIVIDUAL There are some cautions in order concerning logistic regression. Logistic regression will ALWAYS inflate results, especially if the group sizes are very different and one of the groups represents a rare event, For example, if one group size is 95% and one is 5%, then one classification rule (put all subjects in class A) will be 95% accurate. 3

4 Poisson regression should be used for rare events instead. If possible, fresh data should be used to examine the inflation rate of results. PROC MIXED PROC MIXED has two components, y=αx+γz+ε. If γ=0, then the mixed model is identical to the general linear model. If γ 0, then there is some randomness in the model and some covariance between inputs. Special cases of the mixed model are repeated measures, nested designs, and split plot designs. Before the introduction of PROC MIXED, these three special cases were considered using PROC GLM, but with some changes to the error terms. PROC MIXED is a superior method for these cases. In order to use PROC MIXED, the covariance must be estimated in some way. If the investigator has no knowledge of how the input random effects correlate, the default unstructured matrix is the optimal choice. PROC MIXED has a number of possible covariance matrix designs that can be used-but only if the user has a good idea of the structure of the matrix. Standard code is PROC MIXED DATA = WORK.SORT5396 METHOD=REML CLASS CourseLevel Applied Statistics MODEL hours_modified= Applied CourseLevel Statistics HTYPE=3 DDFM=CONTAIN OUTPM=WORK._PRE6476(LABEL="Predicted means.. ) OUTP=WORK._PRE937(LABEL="Predicted values ") RANDOM CourseLevel G TYPE=VC LSMEANS Applied CourseLevel Statistics PDIFF=ALL PROC GENMOD PROC GENMOD generalizes PROC LOGISTIC by allowing for more than binary outcomes. For the general linear model (GLM), the model equation takes the form Y=α+βX+ε so that the estimate is y ˆ = Xβ. The residual error, ε, is assumed normally distributed with mean zero and constant variance. For the generalized linear model, the estimate changes to yˆ g ( yˆ) = Xβ where g is called a link function. If g yˆ) = log 1 yˆ 4 ( and the outcome is binary, then the model is the special case of logistic regression and PROC LOGISTIC can be used. If the outcome variable consists of count data then the link function g ( yˆ) = log( yˆ ) can be used. The assumption here is that the residuals have a Poisson distribution. However, this same link function can be used under the assumption that the residuals are interval data. In this case, the residuals are assumed to form a gamma distribution, which also includes the special case of the exponential distribution. There are a number of other distributions that can be used as well. The problem is that the residual distribution of g ( yˆ) = Xβ depends upon the model, and that model depends upon the choice of the link function. Possible link functions are given in Table 2. Table 2. Examples of Link Functions in PROC GENMOD Outcome Distribution Link Function Binary Binomial Logit Binary Poisson Natural Log (rare occurrence) Ordinal Multinomial Complementary Logit Count Poisson Natural Log Continuous Normal Identity

5 If the investigator has some domain knowledge that allows him to choose a link function, that function should be used. However, if the investigator cannot estimate the function, another way is to estimate Y=α+βX first using PROC GLM while saving the residuals in a dataset. The data can be used in PROC KDE to estimate the form of the distribution. The investigator can then choose the link function that comes closest to the kernel distribution. The kernel can be examined using the following code listed below. Figure 2 gives an example kernel density estimator. proc kde data=sasuser.qury0181 univar hoursgridl=0 gridu=25 out=sasuser.kdehours run PROC GPLOT DATA = sasuser.kdehours PLOT density * value VAXIS=AXIS1 HAXIS=AXIS2 FRAME Run Figure 2. Results of PROC KDE Standard code for PROC GENMOD is given below: PROC GENMOD DATA=WORK.SORT4864 CLASS Applied Statistics workhabits MODEL hours= Applied Statistics workhabits LINK=LOG DIST=GAMMA TYPE3 CORRB LRCI CL ALPHA=0.05 LSMEANS Applied Statistics workhabits ALPHA=0.05 OUTPUT OUT=WORK.TEMP6816 PREDICTED=_predicted1 RESDEV=_resdev1 RESCHI=_reschi1 RUN QUIT PROC GLIMMIX This procedure generalizes the GENMOD procedure to include error terms that are not normally distributed. It also generalizes the MIXED procedure to allow for random effects in the model. However, the random effects must be 5

6 normal. The general format for GLIMMIX is Proc glimmix Class block a b Model y=a b a*b ddf=# Random block a*block Lsmeans a b a*b diff Unlike PROC MIXED, PROC GLIMMIX does not have a repeated statement, and repeated measures are in the RANDOM statement. Possible link functions are given in Table 3. Table 3. Link Functions for PROC GLIMMIX Outcome Distribution Link Function Beta Beta Logit Binary Binary Logit Binomial Binomial Logit Exponential Exponential Log Gamma Gamma Log Gaussian Normal Identity Geometric Inverse gaussian Inverse squared Lognormal Log-normal Identity Multinomial Multinomial Cumulative logit Negbinomial Negative Log binomial Poisson Poisson Log Tcentral T Identity Sample code is given below: EXAMPLES PROC glimmix DATA = sasuser.qury0181 CLASS CourseLevel Applied Statistics MODEL hours_modified= Applied CourseLevel Statistics HTYPE=3 DDFM=CONTAIN dist=gamma RANDOM CourseLevel G TYPE=VC LSMEANS Applied CourseLevel Statistics PDIFF=ALL RUN QUIT Consider the following examples: A test to compare the effectiveness of CT scans to x-ray in the detection of lung cancer. Each patient is randomized to receive x-ray only or CT only. 10,000 patients are in the sample, limited to high-risk patients. The outcome variable is the occurrence of lung cancer. A randomized clinical trial to compare treatment of osteomyelitis (MRSA) with vancomycin and Zyvox. Patients are treated according to protocol, with follow up at 1, 2, 6, 12 months after end of treatment. What if the study is observational rather than randomized? In the first example, the occurrence of lung cancer is rare. Therefore, a Poisson distribution would better fit the study than a logistic regression. In the second, the measure of recurrence is a repeated measure. While it can also be 6

7 examined using survival analysis, the fact that measurements are at fixed intervals rather than continuous will also allow for a mixed models design. CONCLUSION While it is possible to use PROC GLIMMIX as the most complex of the models, it is not advisable. Even so, choices as to random versus fixed effects, link function, and covariance matrix still have to be made. Therefore, the investigator should use the simplest procedure that will accommodate the variable choices. CONTACT Patricia Cerrito University of Louisville Department of Mathematics Louisville, KY (fax) pcerrito@louisville.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Paper 264-26 Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Abstract: There are several procedures in the SAS System for statistical modeling. Most statisticians who use the SAS

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

13. Poisson Regression Analysis

13. Poisson Regression Analysis 136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

Introduction to Fixed Effects Methods

Introduction to Fixed Effects Methods Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD Tips for surviving the analysis of survival data Philip Twumasi-Ankrah, PhD Big picture In medical research and many other areas of research, we often confront continuous, ordinal or dichotomous outcomes

More information

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Chapter 29 The GENMOD Procedure. Chapter Table of Contents Chapter 29 The GENMOD Procedure Chapter Table of Contents OVERVIEW...1365 WhatisaGeneralizedLinearModel?...1366 ExamplesofGeneralizedLinearModels...1367 TheGENMODProcedure...1368 GETTING STARTED...1370

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

More information

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

SUGI 29 Statistics and Data Analysis

SUGI 29 Statistics and Data Analysis Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,

More information

Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc

Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc ABSTRACT Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc Logistic regression may be useful when we are trying to model a

More information

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,

More information

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

More information

GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

More information

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous

More information

Package dsmodellingclient

Package dsmodellingclient Package dsmodellingclient Maintainer Author Version 4.1.0 License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Imputing Missing Data using SAS

Imputing Missing Data using SAS ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

More information

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

More information

Module 5: Multiple Regression Analysis

Module 5: Multiple Regression Analysis Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

More information

Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer

Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer Patricia A. Berglund, Institute for Social Research - University of Michigan Wisconsin and Illinois SAS User s Group June 25, 2014 1 Overview

More information

LOGIT AND PROBIT ANALYSIS

LOGIT AND PROBIT ANALYSIS LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 amitvasisht@iasri.res.in In dummy regression variable models, it is assumed implicitly that the dependent variable Y

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects

More information

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Probability Calculator

Probability Calculator Chapter 95 Introduction Most statisticians have a set of probability tables that they refer to in doing their statistical wor. This procedure provides you with a set of electronic statistical tables that

More information

Logistic regression modeling the probability of success

Logistic regression modeling the probability of success Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might

More information

Joseph Twagilimana, University of Louisville, Louisville, KY

Joseph Twagilimana, University of Louisville, Louisville, KY ST14 Comparing Time series, Generalized Linear Models and Artificial Neural Network Models for Transactional Data analysis Joseph Twagilimana, University of Louisville, Louisville, KY ABSTRACT The aim

More information

Notes on Applied Linear Regression

Notes on Applied Linear Regression Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:

More information

SAS Syntax and Output for Data Manipulation:

SAS Syntax and Output for Data Manipulation: Psyc 944 Example 5 page 1 Practice with Fixed and Random Effects of Time in Modeling Within-Person Change The models for this example come from Hoffman (in preparation) chapter 5. We will be examining

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Regression 3: Logistic Regression

Regression 3: Logistic Regression Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing

More information

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

End User Satisfaction With a Food Manufacturing ERP

End User Satisfaction With a Food Manufacturing ERP Applied Mathematical Sciences, Vol. 8, 2014, no. 24, 1187-1192 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2014.4284 End-User Satisfaction in ERP System: Application of Logit Modeling Hashem

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Assessing Model Fit and Finding a Fit Model

Assessing Model Fit and Finding a Fit Model Paper 214-29 Assessing Model Fit and Finding a Fit Model Pippa Simpson, University of Arkansas for Medical Sciences, Little Rock, AR Robert Hamer, University of North Carolina, Chapel Hill, NC ChanHee

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

Chapter 3 Quantitative Demand Analysis

Chapter 3 Quantitative Demand Analysis Managerial Economics & Business Strategy Chapter 3 uantitative Demand Analysis McGraw-Hill/Irwin Copyright 2010 by the McGraw-Hill Companies, Inc. All rights reserved. Overview I. The Elasticity Concept

More information

Applied Regression Analysis and Other Multivariable Methods

Applied Regression Analysis and Other Multivariable Methods THIRD EDITION Applied Regression Analysis and Other Multivariable Methods David G. Kleinbaum Emory University Lawrence L. Kupper University of North Carolina, Chapel Hill Keith E. Muller University of

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities. Consider a study in which How many subjects? The importance of sample size calculations Office of Research Protections Brown Bag Series KB Boomer, Ph.D. Director, boomer@stat.psu.edu A researcher conducts

More information

Lean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Lean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY Before we begin: Turn on the sound on your computer. There is audio to accompany this presentation. Audio will accompany most of the online

More information

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013 Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Offset Techniques for Predictive Modeling for Insurance

Offset Techniques for Predictive Modeling for Insurance Offset Techniques for Predictive Modeling for Insurance Matthew Flynn, Ph.D, ISO Innovative Analytics, W. Hartford CT Jun Yan, Ph.D, Deloitte & Touche LLP, Hartford CT ABSTRACT This paper presents the

More information

Introduction to proc glm

Introduction to proc glm Lab 7: Proc GLM and one-way ANOVA STT 422: Summer, 2004 Vince Melfi SAS has several procedures for analysis of variance models, including proc anova, proc glm, proc varcomp, and proc mixed. We mainly will

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

More information

Missing data and net survival analysis Bernard Rachet

Missing data and net survival analysis Bernard Rachet Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

List of Examples. Examples 319

List of Examples. Examples 319 Examples 319 List of Examples DiMaggio and Mantle. 6 Weed seeds. 6, 23, 37, 38 Vole reproduction. 7, 24, 37 Wooly bear caterpillar cocoons. 7 Homophone confusion and Alzheimer s disease. 8 Gear tooth strength.

More information

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. A General Formulation 3. Truncated Normal Hurdle Model 4. Lognormal

More information

Concepts of Experimental Design

Concepts of Experimental Design Design Institute for Six Sigma A SAS White Paper Table of Contents Introduction...1 Basic Concepts... 1 Designing an Experiment... 2 Write Down Research Problem and Questions... 2 Define Population...

More information

Poisson Regression or Regression of Counts (& Rates)

Poisson Regression or Regression of Counts (& Rates) Poisson Regression or Regression of (& Rates) Carolyn J. Anderson Department of Educational Psychology University of Illinois at Urbana-Champaign Generalized Linear Models Slide 1 of 51 Outline Outline

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

7 Generalized Estimating Equations

7 Generalized Estimating Equations Chapter 7 The procedure extends the generalized linear model to allow for analysis of repeated measurements or other correlated observations, such as clustered data. Example. Public health of cials can

More information

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations

More information

Local classification and local likelihoods

Local classification and local likelihoods Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

More information

HLM software has been one of the leading statistical packages for hierarchical

HLM software has been one of the leading statistical packages for hierarchical Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush

More information

Paper PO06. Randomization in Clinical Trial Studies

Paper PO06. Randomization in Clinical Trial Studies Paper PO06 Randomization in Clinical Trial Studies David Shen, WCI, Inc. Zaizai Lu, AstraZeneca Pharmaceuticals ABSTRACT Randomization is of central importance in clinical trials. It prevents selection

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

Analysis of Variance. MINITAB User s Guide 2 3-1

Analysis of Variance. MINITAB User s Guide 2 3-1 3 Analysis of Variance Analysis of Variance Overview, 3-2 One-Way Analysis of Variance, 3-5 Two-Way Analysis of Variance, 3-11 Analysis of Means, 3-13 Overview of Balanced ANOVA and GLM, 3-18 Balanced

More information

Section 13, Part 1 ANOVA. Analysis Of Variance

Section 13, Part 1 ANOVA. Analysis Of Variance Section 13, Part 1 ANOVA Analysis Of Variance Course Overview So far in this course we ve covered: Descriptive statistics Summary statistics Tables and Graphs Probability Probability Rules Probability

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS About Omega Statistics Private practice consultancy based in Southern California, Medical and Clinical

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Directions for using SPSS

Directions for using SPSS Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

More information

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract

More information

Statistics and Pharmacokinetics in Clinical Pharmacology Studies

Statistics and Pharmacokinetics in Clinical Pharmacology Studies Paper ST03 Statistics and Pharmacokinetics in Clinical Pharmacology Studies ABSTRACT Amy Newlands, GlaxoSmithKline, Greenford UK The aim of this presentation is to show how we use statistics and pharmacokinetics

More information

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

More information

Development Period 1 2 3 4 5 6 7 8 9 Observed Payments

Development Period 1 2 3 4 5 6 7 8 9 Observed Payments Pricing and reserving in the general insurance industry Solutions developed in The SAS System John Hansen & Christian Larsen, Larsen & Partners Ltd 1. Introduction The two business solutions presented

More information

Statistical Functions in Excel

Statistical Functions in Excel Statistical Functions in Excel There are many statistical functions in Excel. Moreover, there are other functions that are not specified as statistical functions that are helpful in some statistical analyses.

More information

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means Lesson : Comparison of Population Means Part c: Comparison of Two- Means Welcome to lesson c. This third lesson of lesson will discuss hypothesis testing for two independent means. Steps in Hypothesis

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information